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The  phoneme  recognition  system  was  tested  using  isolated  synthesized 
words  which  permitted  evaluation  with  connected  phoneme  strings  but 
stopped  short  of  requiring  development  of  word  boundary  rules.  The  tests 
consisted  of  100  phonemically  balanced  words  containing  281  phonemes. 

Of  these,  245  phonemes  were  correctly  identified,  23  were  mis-ident if ied , 
13  were  missed  entirely,  and  11  were  added.  However,  many  of  the 
errors  were  predictable  or  understandable  and  may  be  overcome  at  a 
higher  (word  or  phrase)  level.  It  is  firmly  believed  that  with  further 
research  and  the  addition  of  some  simple  phonetic  and  linguistic 
rules  this  system  can  be  developed  into  a working  natural  speech 
recognizer  that  requires  only  a small  computer  (or  a small  part  of  a 
large  one),  requires  relatively  small  amounts  of  processing  time,  and 
has  the  potential  of  an  almost  unlimited  vocabulary. 
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p.ilti'iii  lioiiiiii  .1 1 1 -•  s ( s-.'p.iiifiU  .It  ion  on  coiiip.ir  t men  t .i  I i .•..i  I i on  ' . In 
.iil.iition,  rnlos  will  ho  ilovolopi'ii  for  liorivalion  ot  tlio  h.isio  nnit.s 
ot  .spooih  1 phonor.ios ) t rom  t ho  ro.snlta  of  tho  .iiulio  sipn.il  cl.issiti- 
0. It  ton  piooo;;io:>.  .Vo  o om  p 1 i ‘i  1 1 1 11 1’,  Ihi.s  task  will  .ilso  iloiiions  t rat  o tho 

AMKl.  ino.iol.,  , I iin  s n t t i o i out  inloniiation  tlironoh  tho  signal  t r.ins- 
toi.n.itioii  .iiiil  loatnio  oxtr-iotion  oporations.  II  siiccoss  I ii  1 , t In- 
systoia  might  ho  ilovolopoil  into  a iisotnl  .in.ilog  sig.nal  pal  torn  cl.issi- 
fior  for  which  .int  or.iat  ic  spooch  rocognilion  won  K1  ho  only  oiu' 
iiiipor  t .lilt  appl  i cat  i ini . 

has  i c To  nil i no  1 o ’v 

A basic  grasp  of  sonu'  ot  tho  tonus  nsoil  in  tho  iliscnssion  ot 
spooch  proiiuction  aiiil  recognition  is  nocossary  to  uiuiors  t .iiul  tlio  roso.ircli 
prosontoil  in  this  thesis. 

.A  phonomo  is  a basic  unit  of  spoken  langii.igo.  It  is  the  sm.illost 
unit  of  l.ingnago  which,  when  oxchaiigoi.1  for  another  such  unit,  will 
change  tho  moaning  of  a word.  I'honomos  boar  tho  same  ro  1 .1 1 i onsh  i p to 
spoken  l.ingiiago  as  .iliihahotic  ch.ir.ictors  hoar  to  written  l.ingn.igo.  In 
Kiig  1 i sh , AO  to  A-'t  phonoiiios  are  goiii'rally  rocop.n  i .’a'd  • K.ich  ot  those  may 
bo  roprosontod  by  a written  symbol,  and  sever. il  such  symbol  sots  .no  in 
use.  Table  1 on  I'.igo  d lists  two  of  those  symbol  sots,  tho  Inlor- 
n.itional  I’luinotic  Alph.ibot  (ll’A)  and  tho  lntorn.ilion.il  ro.iching  Alph.ibot 
(rr.A),  as  well  as  tlu’  teletype  code  used  to  n'prosont  the  phonomos  in 
this  (irojoct.  An  ox.implo  word  is  also  listed  tor  each  phonomo. 
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Any  oiu'  ptuiiifiiu'  mkiv  t'i‘  proiliiciHl  .is  si-viT.il  SDini'wh.iI  >lilli‘i'MU 
sviiuuls  di'pi'Uil  uii;  upon  or  ili.ilci'I  . K.u’!!  siirh  v.iri.il  loa  .<1  .1 

pluiiu'iiu'  is  o.illi-d  an  .illuphoui'  ot  Iho  ptionomi* . 

Wlion  spoi'oh  is  ,111.1 1 y .'.I'll  wit!i  .1  sot'o  i .1 1 1 y ilrs  lyiual  .•-.pi'i- 1 r.i  I 
.111.1 1 vzo  r oallL'ii  a spoo  r rop,  r.ipli  (Rot.  / ),  sprctril  po.iks  .ippo.ir. 

Till'  spoctral  poaks  aro  cloarly  visihlo  in  par  I roj',  r .11ns  (outj'iits  ot 
t ho  spoc  r roy  r.iph ) such  as  tho  0110  shown  in  I'i'.'.uro  1 on  p.iy.t'  'i  . IW 

o.\tonsivo  study  ot  tho  s poc  t roy.r.iins  of  kiuiwii  soniuls,  spi'och  scien- 
tists have  idi'iitifiod  cortain  principle  spocti.il  ['oaks  with  certain 
sounds  (Ref.  .1  ).  tli'iierilly  three  such  pe.iks  .ire  ideiitilit'd  in  e.ich 
soniul  .ind  art’  calletl  foriiiaiits.  The  foriii.iiit  with  tlu-  lowest  treqiu’ncv 
is  reterred  tii  as  the  first  form.int  .iiul  1 ii's  in  tlu-  ran.n'  ot  TOd  to 
.ROD  11.’..  I'he  tormant  with  the  lU'xt  hiejn-r  freipiencv  is  relerreil  to  as 


the  second  fonn.int.  anil  lies  in  t lu'  ran.'.e  of  7t)()  to  llOl'O  ll.’..  The  next 
higher  forui.int  is  tlie  third  lonii.int  ,ind  lies  in  the  r.in.',.'  ot  ISth)  to 
3‘)l)0  Hz.  As  is  evident  from  tlu’  above,  the  trei|nencv  rey.ions  ot 

I 

ad  jai'ent  formants  iiverlap.  It  the  sound  be  i iiy  in.ilv.'.ed  is  unknown 
to  the  interpretor  ot  t tu'  spec  t roe, ram  and  ;i  peak  appears  in  the  over- 
l.ip  ot  two  n'e.ioiis,  deteniiin.it  ton  has  to  bi'  m.ide  .is  to  whether  it  is 
a principle  peak  and  it  so,  which  ot  the  two  possible  f i.' 1 iii.i  n t s it  is. 
Currently,  tlu-  only  w.iy  to  resolve  this  amb  i e.n  1 t v is  to  know  wh.it 
phoneme  w.is  beine,  produced.  The  de  t i' rm  i n.it  i on  ot  which  spectr.il 
I'e.iks  .ire  s i e.n  i t i c.int  when  the  .'nieech  .sound  is  unknown  is  so  ditti- 
cult  th.it  when  hiehlv  ipi.ililied  spec t ro.e.r.im  te.iders  were  e.iven  spec- 
t roe.r.ims  ot  Tiie,  1 i ;di  si'iiteuces  they  con  Id  not  identilv  t lie  phoiu'iiu's 
which  h.i.l  m.idi'  U['  the  utter.inces  (Ret.  t'  K The  problem  ot  toini.int 
ideiit  i t ic.it  ion  h.is  loiii',  been  thouj'.ht  to  be  the  kev  to  speech 
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ri'i.’Oi;n  1 1 I on  .mil  sovot.il  spcocli  rt*co>;uit  ion  svstiM'.is  h.ivo  I'Oon  t'.isoil  on 
this  promiso.  llowovoi  , tliov  li.ivo  mot  with  onlv  moiioi.ilo  tnoioss 
(Kotoroiu'os  ■'<  . IJ,  11,  aiiii  IS'. 

Mow  Spi'och  i s tlonor.it  oil 

Tho  luimcin  voonl  tr.ict  is  .1  sot  ol  v.iri.iblo  .looustic  olomonts,  iiiuior 
tho  control  ot  tho  spoakor.  Tho  vocal  tr.ict  is  t-xcitoil  hy  .1  porioMic 
impnlso  soiirco  jtonoratoii  hv  motion  ot'  tho  vocal  coiils  .iiul'or  hy  a noise 
rosultiny  trom  .1  continuons  t iirlni  lonco  no.ir  .i  constriction  in  tho  voc.ii 
tr.ict.  Speech  soiiiuis  result  as  a moii  i 1 i cat  i on  ot  the  source  spi’ctrum 
by  the  acoustic  properties  ot  the  vocal  tract  eli'r.ients.  The  v.irious 
cavities  (ph.irynx,  mouth,  n.is.il,  between  the  t.'eth  .mil  lips)  are 
acoustic  elements  auil  tho  movo.iblo  structuros  tlips,  toni’.uo.  j.iw,  .in.i 
velum)  moiiity  the  elomonts  proilucinc,  ilillerent  souiuis.  Souuil;:  proiiucotl 
by  vocal  coiii  excitation  .iii'  c.illoii  voiceJ  soniiiis  . Souiiiis  proilucoil  bv  .1 
noise  source  .ire  calleM  unvoiceil,  voiceless,  or  tric.it  ivo  souiuls. 

Soiniils  proiluci’il  by  the  combination  ot’  voc.il  coiil  movement  aiul  noise  are 
cal  toil  VO i ceil  fric.it  ivo s. 

Souiiits  c.m  be  c.i  t eyor  i ;;eil  by  the  metboil  of  excit.it  ion  .iiul  by  the 
site  an.l  extent  of  constriction  of  the  voc.il  tract.  With  the  voc.il 
tr.ict  open  .iiul  the  voc.il  cor. Is  v i b r.i  t i nc , the  acoustic  elements  .ire 
exciti'il  bv  impulses  of  .1  i r releasoil  bv  t lu'  vocal  coi  Ms  .iiul  cont  inuous 
voicoii  souiuls  Ivowels,  Mi  pht  honc.s , aiiM  semi-vowols)  .ire  proilncoil  by 
moM  i 1 y i nc.  the  v.irious  .icoust  ic  elements.  Vowels  .ire  s.'un.ls  in  which 
till'  term. lilts  appro. ich  .iiiM  rom.iin  no.ii  some  st  o.iMv-s  t .it  o v. tines.  Tho 

M i pii  t lion  .'.s  .ire  souiuls  which  st.iit  .it  one  vowel  .iiiM  then  pioceeM  to  or 
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I'll  (T  tho  V('iooil  I'K . MoviHi’,  tin-  K'wi'r  lip  to  t ho  toi-lh  (.-.uisi's  ('ith.i'i 
till'  voi..'('loss  !■'  (.'r  tlio  Vi'iv.'od  V.  Tho  spi'i-'i.il  (.'.isi'  in  w'nioh  tlu'  point  ot 
Ci'iis  t f 1 0 1 1 on  is  till-  ftU'ltis  is  (.'.illo!  ,is  p i i .i  t i iMI  • In  this  in.st.ir.o.'  tlu' 
Vi'o.il  oofils  .ifo  pl.u'i'.l  iiii.l-wMv  botwoi'ii  till'  ! n 1 1 v I'pi'n  p.'sit'..'n  -in.l  tho 
olosi'd  position.  Thi'to  is  no  voo.il  oi'f.l  v t h f.i  t i on , hnt  thoio  is  tnihn- 
l.inoo.  Cons  t r i 0 1 1 on  .it  tho  plott  is  is  nsod  tot  tho  phonoiiio  11  .r.ul  tor 
whtsporinp,  voiood  soinuls  . 

t’otiiploti'  olosui'o  in  tho  riioiith  .nul  oponiiii;  I'l  tho  n.is.il  o.u’itv  .iio 
n.so.l  to  p.onof.ito  t lu-  n.is.ils.  jX  i i p.is.s.tpo  t htonp.h  tho  iiunith  i stoppo.l 
hv  I'ithof  tho  tonp.no  oi  tho  lips  .in. I tho  vo  1 ni'.i  di  s'ps  to  .illow  tho  .i  i f 
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Stops  aro  also  foinuni  by  a olosuro  in  tlio  moutli  but  t ho  air  is  not 
! allowoJ  to  pass  thronyh  tho  nasal  caviiv.  riuis,  tho  air  1 1 ow  throin-h 

i 

tho  vocal  tract  is  coin[i  lo  t o 1 y stoppoii,  llowovor,  tho  diaplirain  continnos 
I to  raovo  causing  a pressure  buildup  bohiiul  tlu'  blockage.  The  bKickago  is 

I 

I roniovod  suddenly  causing  a surgi'  oi  air.  Therefore,  a stop  can  be 

characterised  by  a rapid  closure,  a short  period  of  silence,  and  a rapid 
! release.  A stop  is  iiither  voiced  or  voiceless  depending  on  tlie  condi- 

tion of  the  vocal  cords  at  the  time  of  closure  and  release.  In  a vv'iced 
stop,  voicing  may  precede  or  accompanv  the  release.  In  a voiceless  stop, 
! voicing  is  delayed  for  30-a0  ms  after  the  release  resulting  in  a burst 

of  fricative  noise.  In  stops,  as  in  fricatives  and  nasals,  tb.e  phoneme 
produced  is  determined  by  the  closure  location  and  the  condition  ot  the 
jcoi-istic  elemeiUs.  C}o;;urc'  at  the  lips  is  used  for  either  a voiced  ii 
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Synthet  ic  Speech  tie ne rat  ion 

Man  lias  attempted  to  syntlies  i r.e  speed)  tv'r  centuries.  These 
attempts  have  ranged  from  the  use  of  bellows  and  levers  in  earlv  m.'dels 
to  the  use  ot  high-speed  digital  computers  and  filter  systems.  Kef- 
erence  8 givi's  a history  of  speech  synthesis.  One  goa 1 ot  the  research 
into  speech  svnthesis  was  to  make  machin<’S  "talk",  but  tb.e  rese.ircli  w.as 
also  to  incre.ise  unde  rst  and  i njt  of  S|'eech  production  .ind  recogn  i t i v'li . 

In  the  last  few  years  inti'iest  in  speech  synthesis  has  been  .ireused 
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spo^'jh  si.'.n.tl  i;;  I'U  K'li.'tn  so  , lypic.illy  10  nis  loni’;  :i;ui 

tliov  .iro  iisiu.',  nat'.iiMl  spoooh.  Kor::i.ini  tr.K'kiii^  is  rho  i L or  i ni;  OL 
tho  lir.o  ovoluLiou  ot  ir.ajor  poaks  ot  tho  powo  r spictnaii  ot  .sp<'Och. 
la  oraor  to  portorm  ionnaiit  Irackinc,  tho  lor:r.aiits  (or  at  loasc  a 
comparablo  moaauro)  must  bo  ox'raotoO  from  tho  spiu’oli  sipnal.  Many 
auto-.r.atio  spoooh  rooop.nitiou  systoms  portonn  somo  sort  of  spoctral 
analysis  in  tho  rocognition  proooss  but  analysis  is  soinotias's  done 
in  t.’rms  of  au  t ocor  ro  1 a t i ons  of  tho  amplitude  variations  of  th.e  spoooh 
wavoform,  or  in  terms  of  linear  predictive  codes  (Ref.  4),  or  in 
terms  of  aero  crossing  statistics  (Ref.  111. 

All  curiu'nt  attempts  at  continuous  speech  recognition  are  top- 
down  systoms  and  they  have  not  progressed  beyond  being  "laboratory 
cu  r i os  i t ! os  . " The  Advanced  Research  Projects  Administration  (ARl'Al 
has  sponsored  a five-year  speech-understanding  project  which  has 
given  a great  di'al  of  imp^'tus  to  research  in  continuous  speech  recog- 
nition. Involv.'d  in  this  research  are  such  prestigious  facilities  as 
Bolt,  Beranek,  and  Newman;  Carnegi e-Me 1 Ion  University;  Lincoln  Labor- 
atory (MIT);  Standford  Research  Institute;  Systems  Development  Corp- 
oration; Haskins  Laboratories;  Speech  Communication  Research 

I.aboratorv;  Sperrv-Rand;  and  the  University  of  California  at  Berkely. 

> 

Comme  rc  i a 1 ly , Bell  Ti'lephone  Laboratories,  IBM,  and  I'exas  Instruments 
are  also  involvi’d  in  spi'ech  recognition. 

Ihe  second,  and  most  simplistic,  categ.ory  of  speech  recognition 
syst.'r.is  is  an  isolated  word  reco;;n  i ;;e  r . Tliese  systems  operate  on 
acoustic  measuit’s  of  a signal  sample  bounded  bv  silenct'.  Thev  treat 
anv  sound  preceded  and  followed  bv  silence  as  a sinc.le  pattern;  this 
pattern  may  be  a word  or  short  phrasi'. 
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t!u'  a.idio  signal  aiul  compareJ  to  each  canui.iate  in  a set  ot  stored 
prot -'types . The  "clo.sest"  match  is  recosniz,?d  as  r!io  word  or  phrase 
tor  that  sample.  Such  a system  is  clearly  limited  to  a small  vocabu- 
lary since  each  sample  must  be  tested  against  all  prototypes.  Per- 
formance of  these  systems  is  determined  by  the  percentage  of 
correctly  identified  words  or  phrases  and-  depends  to  a large  extent 
upon  the  acoustic  dissimi 1 iarity  of  the  members  of  the  prototype  set. 
Devices  in  this  category  are  commercially  available  (Ref.  19)  and  are 
finding  limited  applications. 

The  third  category  of  speech  recognition  is  the  "bottom-up" 
approach.  In  this  approach  the  audio  signal  is  partitioned  into 
basic  speech  units  (phonemes).  It  is  generally  recognized  that  there 
are  only  AO-aA  phonemes  in  the  language.  Therefore,  a phonemic-based 
system  requires  only  a few  prototypes  in  order  to  recognize  all  words 
phrases,  and  sentences.  There  are  two  major  problems  in  developing 
such  a system:  several  acoustic  representations  for  a particular 
phoneme  (allophones)  must  be  considered  and  the  speech  signal  must  be 
partitioned  into  phonemic  units.  .A  phoneme  recognizer  should  be 
evaluated  by  the  percentage  of  phonemes  correctly  identified  in  con- 
nected speech.  Reference  19  is  a recent  overview  of  the  state  of  the 
art  of  speech  recognition. 

Approach 

For  this  dissertation  it  was  decided  to  attempt  recognition  of 
speech  on  a phoneme-by-phonemo  basis.  This  approach  was  selected 
because  recognition  at  the  phoneme  level  requires  only  a small  number 
of  prototypes  for  an  almost  unlimited  vocabulary.  Further,  it  was 
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ilccivl'-.l  to  uso  !iyiU  lii’t  i 0 spoooli  i.itliof  I h.m  iMtiii.il  loi  .li'vol- 

oiii!UMit  .iiul  tostiny,  lli''  tooo)’,n  i t i on  syNloin.  Sviilln'l  i o spoccli  oliniin- 
.itos  sivoimI  prohloms  iniit-n’iil  in  n.itni.il  Mpooili  wli  i I >■  prosiTV  i ni;  t ho 
m.i  jof  .11 1 f i Ini  t os  . Tlio  I’-x.ict  oon  I i p.nr  1 1 i on  ol  synlliotio  spoooh  i .s 
known,  wIuto.is  in  n.itnr.il  :;|)oooli,  t ho  .loln.il  sporn'li  sonmls  lioiii)’, 
.inaivi’.od  .no  not  prooisoly  known  .in>l  o.in  ho  ost  im.itod  ini  1 y hy  ]iro- 
.sont.ition  to  .i  p.ino  I ot  It. lino. i listi'iiors.  Tho  f.i'oooli  .synlho.sis  .svs- 
ti'm  which  w.is  n.-.od  w.i;:  dovoUipo.l  hv  tho  .inthor  ( Uo  t . I 7)  .is  .in  MS(K!'I) 
thosis  proji’ot  sponsiirod  hv  This  sv.stom  j’,i’ no  r.i  I t's  oontiinunis 

spoi'oh  I roin  .i  strinj',  ot  I'lnnionios  .is  .in  input.  Thus,  it  w.is  possil'lo 
to  ilifootly  ooiiip.iro  tho  output  ol  tin'  plionoiiu'  roooy.n  i .'.o  r with  tin' 
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.syiit  Ill's  i .'.or  ri'iiiiiri's  ,iii  input  ot  '-'i  p.n  .iiiu' t o i s i' on  s i s t i iip.  ot  ; i oipion.' i o s . 

t'.in.iwi  il  C *' s , .iiul  .iiiipl  i t iiilos  tor  I'.u'h  ".sonnil"  tli.it  it  iii.iki"i.  In  pioilno  t i on 

ot  spi'i'cti  thoso  soniiit.-i  .no  cli.ingiiij;  oont  i luions  I v , roipi  i r i up.  ,i  no'..’  sot  : 
par.niiotors  ovorv  ti-lO  ms.  Tho  t r oipiono  i o s aro  ooiivo  it  oil  to  loo.it  ions 
for  polos  or  iioros  on  tho  unit  oirolo  in  tho  If-pl.uio.  .-Ml  polo  or  oor.i 
inputs  havo  .'in  associatoil  baiulwiitth  iiijnit  . Tlio  r.iiiial  .1 1 s t .iiiv'o  iiisuK' 

tho  unit  cirolo  is  liirooCly  propo  r t i oti.i  1 to  tho  valtio  ot  tho  h,i  n.tw  i .1 1 h . 

'I'ho  synt  Ill's  i ;'.o  r has  two  noria.i  1 1 y iiuiopoiKloiit  p.iths,  Tho  uppor  or 
voiooil  path  inoliiitos  .i  pitch  impnlso  goiior.it  or , a slnii'ini;  notwork.,  ,i 
n.is.il  polo  .iiiil  I'.oio  notwork,  aiul  .i  r.uli.ition  notwork.  This  path  is 
nso.l  tor  voico.i  soiiiivis  (.vowols,  n.is.ils,  som  i -','owo  1 s , .iiul  voioo.i  stopsl, 
tho  voiooil  portion  ot  voicoii  fricativos,  tho  .ispir.i.it  11  .iiui  i.h  i ■;  po  r i up, . 
Tho  lowor  or  tric.it  ivo  hratich  incliulos  a tioiso  ponoiMl  or , .i  tric.itivo 
polo  aiui  poro  notwork,  aiul  a shapiiip  notwork.  this  hr. inch  is  uso.i  lor 
voicoloss  tric.itivos,  voicoloss  stops,  .iiiii  tor  tho  niivoicoil  portion  ot 
voicoii  tric.itivos.  Tho  two  p.iths  .no  nsoil  topothoi  tor  voicoil  tiic.i- 
tivos.  Tho  void'  p.ith  is  ilrivon  hy  tho  iioiso  poiio  r.i  t or  tor  tho 
.ispiriiit  H .iiui  wh  i spor  i lip . Tho  v.iliio  ot  tho  voico.i  .implitiulo  t ' is 
tho  ilo  t o rm  i II  i lip  f.ictor  as  to  which  p.ith  is  nsoii.  A posit  ivo  .A^.  trip.pois 
a pnrolv  voicoil  or  cor.ihin.ition.il  souiul.  A c.oio  Ay  tripp.ors  .in  inivoicoil 
soniiil.  .A  iii'p.nivo  .Ay  trippors  ,ispi  r.it  ion. 
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Tlu>  .implitiiJ.-  ot  tlie  voicivl  output  ol  tlu*  sytu  lii's  i zor  is  liiiivtly  pro- 
portion.il  to  tlio  vnlno  of  . 

The  pitch  period  (PKR)  i .s  in  milliseconds  and  is  the  inverse  of  the 
pitch.  This  input  causes  an  impulse  to  be  produced  at  intervals  I’ER 
apart . 

Conceptually  the  shaping  network  shapi's  thi'  impulse  into  a form 
resembling  the  volume  velocity  waveform  produced  by  motion  of  the  voc.il 
cords  .ind  the  radiation  network  simulates  the  radiation  impedance  at  the 
lips.  However,  the  shaping  network  and  tlie  radiation  network  are  com- 
bined and  .ire  represented  by  two  poles  on  the  real  axis  of  thi'  H-pl.ine. 

The  three  lowest  formants  (fj,  K-i , K3)  are  the  crux  of  the 

syntiu'sis  str.itegy  and  are  .imply  explored  in  .Appendix  .A.  The  fourth 

form.int  (K,  1 is  set  to  3500  Hz  .ind  the  b.indwidths  of  tlie  four  formants 
•4 

are  s<'t  to  oO , 100,  120,  and  175  Hz  and  remain  const.int  for  all  phonemes 
except  nasals.  The  b.indwidth  of  is  broadened  to  150  ilz  for  .1  nasal 
to  simulate  the  natural  d.impening  of  the  n.isal  cavity. 

The  nasal  pole  (Nl’Ol.l  .ind  zero  (NZKR)  are  only  used  for  nasals. 
During  a non-nasal  sound  they  .ire  both  set  to  l-'iOO  Hz  and  effectively 
canct'l  each  other.  Just  prior  to  a nas.il  NI’Ol.,  NZKR,  and  t In- i r b.in.i- 
widths  are  moved  line.irly  with  time  to  their  t.irget  v.ilues.  .lust  after 
the  nas.il  they  are  moved  liui'.irly  back  to  the  steady  state  v.iliu's. 

Voice  It'ss  I'ath.  The  lower  or  voici'less  p.ith  in  Rig.  ! is  usi'd  f,ir 
the  production  iif  voiceless  fricatives.  The  output  of  the  f r i c it  i on 
gener.itor  is  .allowed  to  pass  into  the  hi. inch  by  si'tting  to  zero  and 
Aj^j  to  .1  positive  value.  I'he  m.ign  i t luii'  of  the  unvoiced  output  ot  thi' 
synthesizer  is  directly  proportion.il  to  the  value  of  A.^.  IMiR  is  uc.ed 
to  cent  rol  the  ilur.it  ion  of  tlie  sound.  The  two  poles  IKl’OI.l,  l-'PCl..''  .ind 
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t lu'  r.iMo  (.Kl’.KK)  ill  this  hi.iiuMi  coutis'l  ( tu'  siiin'li.il  sliaju'S  iii  t hi'  lu'isi' 
p!\>iiiii-i'il . rhi'  "Ivi'f"  I't  lU'iso  pii'ihu'i' I is  .111  I'siii'ut  1 ,i  1 i' h.i  i .u’ ! i' r i s t i >.■ 
o : tilt'  1 f ii.'.it  1 V.'  hi' 1 lu;  s uiiii  I .It  I'll . 

r.itl'.s  in  tVijib  i ii.i  t 1 I'll . Till'  t'^'o  I'.iths  .iro  iisi'il  in  I'niii.h  i n.i  l i I'li  :.>i 
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!\iH'  t'OT  Kilti'i- 

I'hi'  i\Oll  COO  Kilti'r  t short  tor  ooi'hlo.O  iiioilo  1 s t ho  souiul  tr.ins- 
toriii.it  ions  ot  hotti  ttio  iiiiilJlo  .iiiii  innor  o.irs.  It  nsos  .i  t'.in.t-',',is  s 
tiltor  to  s i ii'.u  I ,i  i o till'  iniilillo  o.ir  .nut  .i  vorv  iini.iuo  o 1 oo  t ron  i o ti.iiis- 
111 1 s s i on  lino  to  simul.ito  t ho  livil  ro-iiiooh.in  i .■  .i  I tiinotioiis  ot  t h..'  innoi 
o.ir  ot  till'  ph'.'sio.il  svstoiii.  Tho  iiiiil.llo  o.ir  soot  ion  h.in.t-p.iss  tiltor  is 
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i-'iMit  .It  I'^lU)  H/.  willi  (ulb/m.' r.  i vi-  skirt;;  Ki>;.  on  p.ii’.o  ->). 

Tltis  tiltor  w.is  di's  ii'iii'il  to  tit  o :<po  r ii'.ioii  r .1 1 Uat.i  (Rot.  '>  ). 

1 

AocorJLn>;  to  tlu*  liosippiors  (Ri-I.  1 b ; 1 7 ) t tio  ooclilo.i  portion  ol  ttu>  j 

ROt'  t’OC  Kiltor  is  ",  . . .1  tr.insmission  lino  with  oh.ir.io  r o r i s t i os  th.it 

v.irv  in  .1  s vs  t or.i;i  t i o m.innor  .ilonp.  tiio  lonp.th  ot  tho  1 i iio  . Tin'  propap.i-  ] 

tion  volooitv  ot  a wavo  rravolinp,  alonj;  tho  lino  oh.iniU';;  sysl  om.il  lo.i  1 ly  j 

as  .1  t'unction  ot  hist.inoo  trom  tho  input  to  torminarion,  hocoininj;  over 

slowor  as  tho  w.ivo  projtrossos , In  .addition,  tho  .at  tonn.it  ion  oh.ir.actor- 

istio  ot  ttio  lino  i.s  dosiitnod  so  that  h i gli  froquoncios  .aro  attonu.atod 
no.irost  tho  input  wliilo  lowor  t'roquonoios  prop.ajjato  turthor  alonq  tho 
lino  boioro  .it  t t’lui.i  t i on . Both  prop.ipa  t i <■’11  voliH'ity  and  attonn.ition 
froquoncv  vary  1 o.t.ir  i thinica  1 1 v as  a tnnclion  of  dist.inoo  .ind  .aro 
rol.itod  to  oiK'ti  othor  in  such  a way  th.it  a const. ant  nuiabor  of  cycles 
of  .1  sinusoid.il  signal  aro  stored  in  ttio  lino  botwoon  input  .ind  tho 
location  where  tho  sijtn.il  is  attonu.atod  aO  db.  Tliis  constant  cycle 
stor.iy,o  is  indopondont  of  frequency  input  to  tho  lino.  Ry  pn'por  man- 
ipulation of  tho  design  parameters  it  is  possible  to  dosi  tn  tr.insmission 
linos  with  different  stv'r.age  c.a  p.ac  i t i os . The  stor.ige  of  a fi.ved  luimln'r 
of  cycles  in  tlie  transmission  line,  jd''lepoj[nb'iU  ot  the  i npnt  nu'quency, 
is  the  foaturi'  that  d is  t ini’.uishos  tho  cl.iss  of  COC  filters  from  .ill 

others.  In  this  typo  of  transmission  lino  we  .ire  not  trying  to  obtain  i 

tho  input  sign.il  unmodified  and  delayi'd  in  time  from  v.irious  t.ips  .iloag 
the  line.  ' R.ithor  wo  aro  intori’sted  in  observing  tho  imul  i f i c.i  t i on  ot 

the  sign.il  that  t .ikt's  pi. ice  as  it  prop.ip.a  t es  .ilono,  tlie  1 i iii' . " 

K.xpt' r iment.a  1 data  (Ref.  " '>  shows  th.it  t hi'  plivsic.il  cochlea  stoics 
botwoon  1 . b .ind  2.0  cvclos  of  t lu'  input  sign.il.  RtXl  t'OO  is  designed 
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to  store  1.75  cycles.  Kurtlu-r  the  ROt)  COC  Filter  is  desiv’iieJ  to  have 
no  rt't  lection  hv  havinj;  it  properly  terminated. 

The  amplitude  ot  the  speech  signal  into  the  KOC  COC  Filter  must  he 
controlled  so  that  it  does  not  exceed  tlie  ilynamic  ran.'.e  of  the  instru- 
mentation. This  control  is  done  by  hand  to  preserve  the  amplitudt’ 
fluctuations  in  normal  speech  whicli  are  approximately  dO  db. 

I 

Inside  the  KOC  COC  there  is  a f requency- dependent  amplitude  j 

envelope  in  which  the  sip;nal  is  contained.  Figure  5 on  page  75  shows 
two  signals  of  different  frequencies  "frozen  in  time"  to  demonstrate 
the  amplitude  envelope.  As  the  signal  passes  down  the  line,  it  sKn.lv 
increases  to  a peak  in  amplitude.  After  it  passes  tlie  peak,  it  is 
rapidly  attenuated.  The  position  of  the  amplitude  peak  in  t lu'  line 
is  a logarithmic  function  of  frequency  with  the  liigher  frequencies 
peaking  first  and  the  lower  frequencies  peaking  later. 

In  the  physical  cochlea  the  detector  a^nd  nerve  cells  are  arrayed 
along  the  mechanical  line  and  are  so  numerous  that  it  can  be  considered 
a continuous  sampling.  Ih’cause  continuous  sampling  is  impossible  to 
.achieve  in  an  electronic  model,  tlie  ROC  COC  Filter  was  designed  with 
AS  taps  as  a reasonable  compromise.  It  must  be  kept  in  mind  tiiat  it  is 
the  moditied  signals  at  tlu’  various  taps  that  ,ire  of  interest  aiiii  t lu' 
sampling  ot  these  signals  is  the  function  of  the  CxC  Computer. 

Cx^  Comput e r 

The  CxC  Ciimputer  is  a unique  ['iece  of  h.aidwate  tliat  models  the 
log.ic  ri'spiinses  ot  the  nervvnis  systt’m.  I'he  computer  was  designed  and 
developt'd  based  on  hypotheses  th.it  h.ive  been  experimentally  verilied 

but  not  proven  1.  Cxi'  is  m.ide  up  of  multiplVs  ot  three  b.isic  j 
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s h.ma-'.v  i fi'ii  (or  '.iioro  i-or  ri'o  t 1 v , U-ukI -p.i t ctu  il ' t I'i’.o  t !u' r . 
r!u“yt‘  e t'.TipoiK'iu  .s  .lit,'  t lio  s viiloJi- r s , t iu-  svn.ipr.v'  butt. mis,  .ui.l  t iif 
suniplo  .m>i  hoKt  circuit.s.  Usin>;  tho  thr'i>  ,,-oinpuiu'nt  s , .1  r.u>di.'l  o!  an 
in.iiviJual  lu’urou  or  .1  ,itroup  ot  nouroas  oan  bt'  .lohiovod.  A 1 1 bi''u,tti 
not  a ptivsical  part  ot  CxC.  tho  luirJw.iro  iutortaoo  .uui  t iio  i’ni’-ll 
Jiftital  oonputor  usivl  tor  vi.ita  oolloctii'n  are  i'.ooo  s ,sa : v iv'r  t !’.o  op'r- 
ation  ot  CxC. 

^^loodt-rs.  Tbo  b.isio  lo.pio  olorioat  ol  tho  t'xt'  l'o;rputor  is  tb.o 
svncoiior.  Kuno  t i oiia  I I v tho  synoodor  is  a loaky  tut  opr, 11  or  ana  surr.r.i  up 
jup.otiou  tollowod  by  a comparator . CxC  is  a unique  typo  oi  computer 
because  o'ach  svneoJor  is  voltapo  controllable  and  its  transtor  function 
is  sipnal  dopoiuiont  . Tho  syncodor  performs  a loaky  i n t o c,  ra  t i vmi  and 
summation  of  all  inputs.  Tho  result  is  compared  to  .in  exponentially 
decavinp  threshold  and  when  the  two  are  equal  a pulse  of  piven  v.iliie 
■inJ  duration  is  penerated  as  an  output  and  the  exponenti.il  rhresb.old 
is  re- i n 1 1 i .It  ed  . 

Kxpe  r iir.tMi  t a t i on  h.is  sh.'wu  th.it  this  threshol.i  in  a re.il  p.euron 
.ipproxinates  .in  exponenti.il  dec.iy  .me.  th.it  tlie  time  const.int  ot  the 
dec.iv  is  .1  r.indom  vari.ible.  The  mo.lel  th.it  bt'st  tits  the  expetiment.il 
.1.11.1  IS  one  in  which  a new  time  const.int  is  r.indomly  selected  e.icb.  time 
the  exponenti.il  threshol.i  is  re- i n i t i it  ed  . Once  a time  const.int  is 
selected,  it  is  not  ch.inpe.l  until  the  threshold  is  .ip,.iin  r e- i n i t i .1 1 ed  . 
However,  prodtieinp  0x0  with  such  r.ind.'m  svneoder  elements  w.is  not 
technically  n.ir  economically  fe.isible.  Therefore.  A'lKl.  desipners 
.li’ci.led  to  m.ike  tlu'  sync.'ders  de  t e rm  1 n i s t i c bv  fixiiie,  tht'  time  const.int 
of  e.ich  .piven  unit.  llowevi'r  , .iilterent  svneoders  n.iv  be  set  with  dit- 
terent  time  const. ints. 
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"posiciva  rot'ac'torv  period"  tor  the  duration  ot  tlie  output  pvilse 
length.  That  is,  the  threshold  goes  to  intinity  and  the  neu.ron  (syn- 
eoder)  cannot  be  fired  regardless  ol  the  input.  Thus,  the  response 
to  a DC  level  input  is  a periodic  string  of  pulses.  The  response  to 
a tirte  varying  signal  is  complex  and  depends  upon  integrition  tir..', 
threshold  decay  constant,  and  refractory  time  of  the  syncoder,  .ill  of 
which  are  adjustable  on  each  syncoder.  These  parameters  are  adjusted 
according  to  the  use  of  the  particular  syncoder  in  the  network.  The 
syncoders  that  are  used  as  detectors  on  the  aS  taps  of  the  ROC  COC 
Filter  are  set  so  that  they  will  fire  on  each  peak  of  t'ne  highest 
frequency  that  can  reach  that  particular  tap  at  maximum  input  ampli- 
tude. Because  the  syncoders  cone  iniious  ly  eomp.ire  input  to  the  thrt'sh- 
old  they  are  obviously  amplitude  dependent. 

S vn.ipse  But  t on.s . A synapse  button  is  connected  to  the  pulse 
outp-at  port  of  a syncoder.  It  is  basically  a switch  that  conducts 
when  the  pulse  output  of  the  syncoder  is  high.  The  eutputs  of  these 
switches  are  normally  connected  to  the  integrating  inputs  of  otlier 
syncoders.  Therefore,  when  a syncoder  fires,  the  switch  conducts  and 
a voltage  applied  to  one  side  of  the  switch  appears  at  tlic  integr.itor 
input  of  syncoder.  A voltage  is  preduced  at  the  syncoder  summing 
junction  and  tlu’  volt.igc  exponent  i.i  1 1 y incre.ises  wiiile  tlic  .switch  is 
closed  and  immediately  begins  an  exponential  di'cay  when  the  .sv.'itch 
opens.  Pulses  can  easily  be  weighted  or  assigned  relative  signifi- 
cance by  controlling  the  volt.iges  to  the  syn.ipse  buttons.  E.ich  pulse 
output  can  "fan  out"  to  eight  inputs. 

t 


Samp  It-  and  lioKl.  Tin?  sample  and  hold  (s&h)  circuits  supply  the 
voltage  sources  required  by  the  synapse  buttons  and  OC  levels  that  are 
added  at  the  summing  junctions  to  bias  the  various  time-varying  signals. 
These  circuits  are  controlled  by  a PDP-?d/S  digital  computer.  The 
PDP-8/S  addresses  each  s&h  board  individually  and  supplies  a predeter- 
mined voltage  to  that  s&h  board  through  a digital  to  analog  converter. 
The  PDP-8/S  requires  less  than  two  minutes  to  sequentially  address  all 
1728  s&h  boards  in  CxC.  The  voltage  on  a s&h  board  alter  the  required 
two  minutes  is  about  97%  of  the  original  voltage. 

Hardware  Interface.  The  Asyncronous  Pulse  Pattern  Processor 
(ASPPP)  is  the  hardware  interface  used  to  sample  up  to  32  pulse  outputs 
from  CxC  and  store  the  results  in  a PDP-11/20  digital  computer.  Each 
five  microseconds  the  ASPPP  looks  for  up  to  two  rising  edges  of  pulses 
on  the  32  channels  starting  with  the  first  channel  output  of  CxC.  If 
it  finds  at  least  one,  it  records  the  channel(s)  and  the  time  since  the 
last  pulse  was  recorded  on  any  cliannel.  Although  the  ASPPP  can  only 

record  the  first  two  pulses  (in  channel  order)  in  a five  microsecond 

f. 

time  section,  this  has  proven  to  not  be  a cause  of  significant  loss  of 
data.  In  an  inform.'il  inspection  of  speech  data  it  was  found  that  two 
channels  had  fired  "simultaneously"  less  than  5%  of  tlie  time.  There- 
fore, the  amount  of  data  lost  due  to  a third  simultaneous  firing  must 
be  extremely  small. 

P rograms . The  ASPPP  pru'Siuits  the  data  reciuved  from  CxC  to  the 
PDP-11  computi-r  but  computer  pri'grams  are  needed  to  accept  the  data  and 
control  the  sampling.  AMKI,  ha^^  several  such  programs,  one  of  which  is 
used  extensively  in  this  project.  This  program  starts  data  collection 
when  a switch  is  manually  depiussiui  and  stops  data  collection  wlien  tlu' 
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Thev  are  plots  ot  t lu'  pulses  on  t'ae  U’  pulse  output  eiiannels  01  0\i’ 
versus  time.  I'lie  location  ol  the  pulses  iu  time  is  an  in. libation  .’1 
when  the  pu  1 si's  .'ccurre.i.  In  th.'se  | icures  t h.'  hiph  1 r.'.iu.'ncv  c.'mp.'ii- 
.‘iit  s ot  the  s.Miiui  art'  i.i  the  l.'w.'t  p.u  t ion  ot  each  pl.'I  an.l  t h.-  l.'wei 
lit'ijuency  c.-mp.'nenls  ar.'  si't'ii  at  1 h<'  t A stjuaie  wavi'  has  hic.h 

fret'uency  components  at  tin'  risinc,  an.l  1 .i  1 1 1 n.-,  .'.Ipa's  .uul  l.'Wt'i  1 1 ,.;u<' nc v 
coraj'.ment  s t hroin'.h.'ut  . Thus,  in  t hi'  pl.'ts  .'1  tlu'  '.ipiai  ,■  wai  es  a l.'uc, 
ch.iin  I’t  pulses  is  .ipp.ir.'iU  ,it  tlu'  I t','..l  i lu',  t'.ip.e  .lu.l  a .".li.'i  t stiinc  .'t 
hiyh  t fi'ipii'n.v  c.'mp.'ut'nt  pulst's  .11  e .ipp.ii.'ut  .it  t h.'  tailin',  I'.lv','.  I'ue 
c.in  .ils.i  noti'  tli.it  till'  lowi'r  1 ri't|Ut'nc  i t's  pr.'pa;'a  1 1'  luither  .l.'wu  th.e 
Ki\'  I'tH'  .111.1  also  .li.l  iii't  tu'c.in  t i r i iiy.  chanii.' 1 •;  as  s.'.'ii  .is  t lu'  hi.'.h.'t 
1 ri'i|Ui'iic  1 1's  ili.i.  In  htit  h tlu'  stpi.ii.'  .uul  siiu'  w.ii.'  Ii  uies,  t lu'  .h.ui  • 1 it  • 
vi'l.icitv  .It  till'  w.ives  is  .1  1 s.'  re.ulllv  .ippar.'iit  .is  ,1  .uivatui.'  1 a th.' 
puls.'  p.itt.'fns.  11  the  v.' 1 i t i e s ot  t h.'  si'iials  h.ul  i.'m.iine.l  .•.'it  taut 
the  pulse  p.itl.'ins  w.'ul.l  h iv.'  .tpp  1 \ 1 ma  I .■.!  a s 1 .'p.'.l . stiiicht  1 1 lu' . 

Kic.uii's  1.’  t h ii'u'.'.ii  .’1  .'It  I'.i.'.t's  l.'l  thi.'ii.'.h  i''  pi.'s.-ut  the  .'utput-.  .<1 
(oill  t.'r  Si'veial  natiii  il  .in.l  svnth.'tic  spe.'.h  s.'iui.l-..  On.'.'  ic.iiit.  the 


liiv’.ii  ! t <'i|iuMU' V I'ompoiuMit  s ol  t tu'  Hiuiiu!.':  .ire  si'iMi  i\i  t lu-  K)w<t  poili.ni  »'! 
i-.K'h  |>lot  .iiul  tlu'  low.'r  t ri'.pu'iuy  i'omp>Mii'ut  ■!  .110  .11  t lie  top.  It 

o.m  bo  soon  that  tho  sviilhotio  pattoriis  ooiiiparo  witli  thoir  uatiiral 
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was  hc'lti'Vt'ii  that  t ht’ ro  niiist  ht'  a way  ol  ri'Oi',-.ii  i ;•  i :i^  wlia!  somicis  woro 
boinp  :ii.i.lo  oithor  on  a pitoli  porioh  basis  (lor  voiooh  sounds'  or  on  a 
small  samplo  basis  (tor  nnvoioovl  sounds).  This  rooop.nition  is  t !ii“  subji'ot 
v->I  t'liaptors  111  and  IV. 
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CxC  Pulse  Output  for  1000  Hz  Sine  Wave 
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Czy  Pulse  Output  for  1000  ¥.z  Square 
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Fig,  18  . CxC  Pulse  Output  for  Natural  ZZ 


CxC  Pulse  Output  for  synthetic  ZZ 


CxC  Pulce  Output  for  I.’atural  FF 


Tins  PAG'E  IS  BRST  QUALITY  FRACTlCA-IiLE 
FRlUI  eoi'li  PUi-O^lSlUiU  XU  DUG  


III.  Segment  IJent i f icat ion 


In  this  chapter  the  methods  of  recognizing  small  segments  of 
synthesized  speech  from  the  data  collected  from  CxC  are  presented.  The 
segments  for  voiced  speech  are  delineated  by  pitch-period  marker  pulses. 
For  a male  speaker  or  for  the  output  of  the  speech  synthesizer  used 
here,  these  pulses  are  from  six  to  ten  milliseconds  apart.  The  pitch 
period  marker  allows  the  natural  periodicity  of  voiced  speech  to  be  used 
for  segmentation.  When  the  pitch-period  marker  pulses  are  absent,  indi- 
cating voiceless  speech,  data  is  analyzed  in  ten  millisecond  time  seg- 
ments. Silent  periods,  no  pulses  on  any  channel,  are  analyzed  as  a 
single  unit  regardless  of  their  length.  Analysis  of  each  segment  is 
based  on  the  time  between  pulses  on  each  channel  and  the  number  of  times 
each  channel  fired. 

Initial  Manipulat ions 

After  data  from  CxC  is  retrieved  from  disk  storage,  two  initial 
manipulations  are  performed  on  each  segment.  The  first  of  these  man- 
ipulations is  a pulse  interval  determination  and  the  second  is  a channel 
firing  statistic. 

Pulse  Intorva 1 Determination.  There  are  30  channels  of  data  output 
from  CxC.  The  data  on  each  channel  are  pulses  of  constant  amplitude  and 
duration  produced  by  a syncoder  operating  in  a specific  network.  The 
ASPPP  records  the  time  of  the  rising  edge  of  a pulse  and  the  channel  on 
which  it  occurred.  The  most  obvious  data  manipulation  is  to  determine 
the  time  between  pulses  on  a particular  channel.  The  duration  of  this 
"pul.se  interval"  is  limited  to  between  0.01  ms  ( 100,000  liz)  and  4.80  ms 
(20S  Hz).  These  limits  were  chosen  based  on  known  speech  frequencies 
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jiui  on  an  analysis  ot  ilio  range  of  intervals  in  ilio  Jala  from  CxC.  Kach 
pulse  interval  eonsiJereJ  is  rouiuieJ  off  to  the  next  lower  0.;''!  ;:ts 
increment  ami  rocordoJ  in  a linear  matrix  of  480  elements.  Pulse 
pattern  deterv.iinat  ion  is  June  without  regard  to  tlie  cliannel  on  wtiich 
ttie  pulses  occurred.  For  eacii  segr.ient  a histogram  of  tlie  number  of 
occurrences  of  each  pulse  interval  is  generated.  The  liistogram  is 
normalised  so  that  there  are  a total  of  300  pulse  interval  occurrences 
in  eacli  histogram.  Typical  histograms  for  a single  pitch  period  of  lY 
and  AA  and  a ten  millisecond  segment  of  SS  are  displayed  in  Figures  34 
through  36  on  pages  53  througli  54  . 

Channel  Firing  Statistic.  The  second  initial  process  is  to  deter- 
mine how  many  times  each  of  the  JO  data  cii.innels  fired  (.produced  pulses'' 
within  the  current  segment.  The  result  is  stored  in  a linear  matrix  of 
30  elements. 

Speech  Catt' gc^r  i^a t j iMi 

For  convenience,  speecli  is  viivided  into  tliree  categories  and  two 
special  cases.  Tlie  first  category  is  steady-state  speech;  that  is, 
soumls  which  occur  at  or  very  near  sti'ady-st  ate  values  for  either 
several  pitch  periods  (voiced  sounds!  or  for  an  extended  lengtii  of  time 
(voiceli’ss  soumls!.  The  Si’coiul  categorv  is  dynamic  speech,  which  is 
cha rac t er i ned  hy  rapid  changes  in  tlu'  speech  pat  lei  ns.  The  six  stops 
(B,  D,  0,  P,  T,  and  H)  are  the  sole  members  of  this  division.  The  third 
category  is  the  aspirant  11  which  is  a uniii'ae  sound  in  American  speech. 
The  two  special  cast’s  are  a stop  in  utterance  initial  position  and  a 
st.ip  in  nttt’iance  final  pi'sition,  Thesi'  art’  special  t’ccausi’  tin’  initial 
"shut  di'wn"  ptirt  ion  of  tlie  stop  will  be  missing  wln'ii  the  st  .’p  is  in  the 
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ut  t ..‘raaco  initial  position  and  tho  linal  "ro  1 oaso"  port  uni  ot  tiie  ston 
mav  be  missing  wlu'ii  it  is  in  the  ntterance  final  position. 

The  parameters  and  characteristics  used  for  recognition  of  syn- 
thesized speech  are  also  present  in  natural  speech.  Although  natural 
speech  is  more  complex  and  less  consistent,  there  is  reason  to  believe 
that  the  methods  presented  here  will  make  an  excellent  starting  point 
for  the  recognition  of  natural  speech. 

Steady-State  Speech 

Steady-state  speech,  in  which  sounds  approacii  ai.d  remain  near  some 
"steady-state"  patterns,  includes  the  ten  vowels,  two  of  tlie  four  semi- 
vowels (l.L  and  RR),  three  nasals,  four  voiced  and  four  voiceless  frica- 
tives as  shown  in  Table  I on  page  3 . (The  other  two  semi-vowels,  YY 
and  '.vV,  start  near  a particular  vowel,  lY  for  YY  and  00  for  Ms’,  and 
glide  toward  the  following  sound.)  In  all  these  cases  tho  sounds  are 
treated  in  an  identical  manner;  the  only  differentiation  between  voiced 
and  voiceless  is  in  the  manner  of  segmentat ion ; pitch  period  for  voiced 
sounds;  10  ms  segments  for  voiceless. 

Each  segment  is  examined  and  identified  independently  of  the  pre- 
ceding and  succeeding  segments.  Final  identification  is  made  for  each 
segment  based  on  tlie  results  of  three  independent  classification 
procedures.  The  tirst  classification  is  based  on  three  moments  which 
are  calculated  from  the  480  element  pulsi-  interval  matrix.  (.A  fourth 
moment  is  used  in  partitioning  the  speech  signal  into  phonemes  as  dis- 
cussed in  Chapter  IV.)  Tlie  second  classification  is  a "pattern  match" 
of  the  pulse  interval  matrix  with  similar  matrices 
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from  the  25  masters 


(known  s t o.i>l  v-s  t n I o sinimls).  Tin'  Iliitil  o 1 .1  s s i I i o .1 1 i on  iiriHoiiuii'  is  .1 
pnttorn  innti.’li  ol  ttjo  U)  olomonl  oh.innol  tiring  m.ilrix  with  simil.ir 
mntrici's  ot  tho  in.istors.  Tlio  n'siilts  ot  thoso  tliroo  nu'llioils  nro  oom- 
biiii'i)  to  lit’ t o nil  i no  tho  most  likoly  0 ,in.l  i h.i  t o Uif  I'.ioh  soy.iiu'iit  . 

Momonts.  Four  Jit'firont  momonts  .iro  0.1 1 on  1 ,1 1 oil  trom  ttio  h.it.i  in 
tlio  (nilso  intoiv.il  mntiix.  'I’ho  first  ot  thoso  (rotorroil  to  lu'ro  ,is  t lio 
rnw  momont)  is  n st.inil.ird  first  raomont  nhout  1.(1  ms  (1000  !h:).  Any 
pnlso  intorv.il  oocnrronv'os  of  loss  th.m  1.0  ms  nro  con  ■.  i OiMiwI  no;’, .it  iv<’, 
anil  inv  pnlso  intorval  occnrroncos  jtro.itor  th.in  1 .t)  ms  nro  consiOoroii 
positive.  Thus,  a soiiml  witli  a h i j’.h  inciOonco  ot  sliort  intorv.ils 
(lii;th  froiini'ncvl  wonlil  h.ivo  a larc.o  no;’,.itivo  raw  momont  .inO  .1  S 'li,.  ' 
with  a hi;;h  inciOonco  of  Ion;.’,  inti’rv.ils  wonlO  h.ivo  a l.ir.co  positive 
raw  momont.  The  r.iw  momont  is  iisoO  in  pa  r t i t i on  i n;’,  spi-och  into 
phonomos . 

For  tho  other  throe  momonts,  tho  pulse  intorv.il  matrix  is  OiviOoO 
into  throe  over  1 .ipp  i ni*  sections  wliich  corrosponO,  more  or  loss,  to  tho 
froiinoncv  ro;tions  of  tho  first  throe  formants.  Thoso  sections  .no  O.OI 
to  (l.h.'i  ms,  (1.10  to  1 .'lO  ms,  anil  l.lltl  to  .'♦.80  ms.  A st.iiul.iril  first 
momont  is  calcnl.itoil  ahoiit  tho  short  intorval  (hipli  Iroipioncvl  oiiil  ol 
o.ich  of  thoso  sections.  K.ich  sopiiiont  is  scoroil  apainst  simil.i:'  mo.isiiros 
from  tho  roloronco  soiimls  hy  c.i  I on  1 .1 1 i ni’,  tho  Fiic  I i ilo.in  I'ist.inco  (si[ii.iro 
root  ol  tho  sum  of  tho  siiuaios  of  tho  0 i ( t o roncos ) holwoon  tlio  throe 
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r.itik  tor  o;u'h  iiuisli'r  is  aililivi  acras.-:  thf  ttii-.’i'  iiiL’tluvIs  .iiul  t lu'  sdiuuI 
c .in>l  i il.i  t o with  t ho  Kiwost  total  rankins;  is  solootivl  as  t lio  most  I i ko  1 v 
caiuiiilate  tor  that  sogmoiit  . 

Stops  liit.Mual  t_o  £ho  I't  t_i'j;.inoo 

It  was  titiiokly  discovoroii  that  t Ito  c 1 ass  i t i c.it  i on  mothoiis  nsoh  tor 
stt'ijv-stato  spooch  iliil  not  work  tor  stops.  Ono  c omp  1 i oa  t i on  w.is  that 
sounds  adjaoont  to  a stiip  atl'oct  tin'  oha  rao  t o r i s t i c s ot  tho  stop.  This 
problom  was  solvod  hy  nsin;.;  np  to  six  variations  ot  oach  stv'p  as 
rotoronoo  pat  torus.  Ilowovor,  an  additii'nal  prohli-ni  was  ohsorvod  whioh 
w.is  not  as  oasily  solvod. 

Tho  CxT  Ci'inputor  is  sonu'what  sonsitivo  to  amplitndo  and  tho  siy.nal 
amplitndo  drops  rapidly  and  inoroasos  rapivily  dnrino,  a stop.  t’onsi’- 
ipiontly,  thoro  aro  tow  pnlsos  in  tho  Low  am[)litudo  si';tii’.ont  s that  aro  ot 
gii'.it  intorost  in  sti)ps  and,  whon  t h.o  luimhor  ot  pnlsos  is  oxlroiuoly  low, 
t lio  pnlso  intorval  matrix  Ci'r  ro  1 a t i on  soi'ros  and  t !u'  'iiom 'lit  s ot'  tho 
pnls.'  intorval  matrix  aro  moro  snsooptiblo  to  minor  vat  iat ions  ot  tho 
input  signal.  This  snscopt  ib  i 1 i t v oansod  tho  corrolation  and  momont s 
ot  tho  pnlso  inti’fval  matrix  to  bo  nnsnitablo  moasnros  lor  sto|'s. 
Although  till'  oorri'lation  I't  tho  channol  tiring  matrix  divi  provo  to  bi' 
viabli',  anotht'r  m.'tiioil  iil  o I as  s i I i oa  t i I'li  had  to  bo  tonnd  tlial  was  loss 
snsooptiblo  to  miiu'r  signal  vatiations  than  tho  iiuimont  s and  oorrolation 
I't  t h(‘  pn  1 si'  intoi  val  matrix. 

A mothod  lit  o 1 ass  i t' i oat  i on  that  is  loss  attootod  by  signal  vari- 
atiiins  is  to  ilivido  I In-  pnlso  intorval  matrix  into  sovoral  ovorlapping 
soot  ions  in  "windows."  I'lio  dtmonstons  I'l  tho  wiiulin.’s  woro  solootod  !'v 
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stiKlviiii’  a coiiiDosLti'  liistoi’rain  lor  so;>,r.u'nt  : in  tlu-  shut  liowii  aiui  roloase 
of  st'ViTil  stop  variations  ami  soloctin;.;  tlio  mill  or  U'w  points  in  t ho 
li  i stogram.  Thus,  tlie  poaks  in  tho  histograms  ot  t ho  sograont  s nsoh  aro 
containod  in  ono  or  moro  wimiows.  Iho  limits  of  tho  windows  aro  0,01 

to  0.30  ms,  0.30  to  0.48  ms,  0.43  to  0.74  ms,  0.h4  to  0.05  ms,  O.OO  to 

1.20  ms,  I.IO  to  1.55  ms,  1.50  to  1.85  ms,  1,8  3 t ii  2,13  ms,  2.11  to 

2.40  ms,  2.30  to  2.70  ms,  2.60  to  3.10  ms,  3.0tl  to  3.80  ms,  3.75  to 

4.37  ms,  and  4.34  to  4.80  ms.  Kor  oach  sogmcnt  in  qnostii'n,  tho  numbi'r 

of  pulso  interval  occurrences  that  fall  into  e.ich  window  is  dott'rmined 
and  the  result  is  correlated  against  st.imiards  for  all  reterenco  stop 
variations.  Reference  patterns  for  both  shut  down  ami  release  segments 
of  several  variations  of  eaeli  stop  are  stored  giving,  ,i  possibility  of 

up  to  72  total  reference  p,itterns  for  the  six  stops.  Currently,  onlv 

30  different  reference  patterns  .ire  being  usi'd.  These  patterns  are 
from  the  List  shut  down  or  first  release  segment  of  a partieul.ir  stop 
in  which  tb.e  number  of  pulses  in  the  segment  exceovied.  28.  One  pattern 
for  I'ach  of  the  shut  down  ami  release  of  each  of  the  six  stops  w.is 
stored.  These  12  patterns  were  tested  against  various  sound  combin- 
ations and  when  a problem  .iros<'  .in  additiiuial  pattern  w.is  si. 'red. 

Vn  for  t una  t el  v , evi'n  with  the  use  of  multiple  masters  f.ir  iMch  st.'p 
the  correlations  of  the  chanm-l  tiring.s  and  the  window  functi.'ns,  in  .imi 
of  tlu'inselves,  were  not  suffieiiuit  to  iilentify  the  st»’ps.  I'lu'  patterns 
for  1)  and  C. , tor  example,  are  very  simil.ir  to  vi'we  1 s ami  thi'v  corral, iti' 
very  w<' 1 1 with  t lu’  vowi' 1 s . Kor  synllu'tic  sp.u'ch,  tin'  "corielation 
f.ictor"  for  a vowel  against  a masti'r  f.ir  P ci'uld  bi'  as  high  as  ('.‘hi.  A 
masti'r  patti'rn  tor  11,  on  t hi'  other  hand,  mav  Ci'rrelate  against  a vowel 
willi  a result  as  low  as  ll.ti,  Iheretore,  it  a vowel  tollowe.l  bv  a 8 is 


so 


input  to  till'  svstoin,  in  onlor  to  corroctly  iJi'ntiiy  tlio  H tlio  corrol.i- 
tions  against  tiio  li  mastor.s  woiihi  hav.>  to  rise  dramatically  while  the 
correlations  against  the  D masters  would  have  to  tall  dramatically. 

For  an  example,  say  a D master  gives  an  average  ot  correlation  ot  O.SO 
during  the  vowel  and  then  falls  to  a low  of  0.65  during  the  stop. 
Furtlier,  say  the  best  B master  gives  an  average  correlation  of  0.10 
during  the  vowel  and  then  rises  to  a high  of  0.60  during  the  stop. 

Figure  on  page  ()1  presents  just  such  an  example  of  window  function 
correlations  against  a master  for  each  B,  D,  and  G for  a vowel-stop- 
vowel  comb  inat  i'on.  The  figure  is  a plot  of  correlation  against  time 
for  the  three  masters.  Only  three  stop  variations  are  used  for  clarity. 
Even  at  first  glance  it  is  fairly  obvious  that  the  stop  being  analyzed 
is  a B.  But,  it  is  necessary  for  the  system  to  automatically  make  tlie 
Same  determination  and  simply  taking  the  highest  result  for  any  segment 
is  not  sufficient.  Therefore,  a method  had  to  be  incorporated  that 
discriminates  the  rises  or  falls  of  the  correlation  results. 

One  such  method  is  to  select  a point  during  the  preceding  sound 
and  use  it  as  a basi-line  for  miMsiiring  the  rise  or  fall  of  tlie  correla- 
tions. It  was  decidi'd  to  use  as  a measure  the  ratio  of  the  increase  of 
the  correlations  for  the  current  segment  versus  the  maximum  possible 
incre.ise  above  the  baseline.  If  the  baseline  for  a sli'p  master  is  0 . hO 
then  the  correlations  can  rise,  at  most,  O.'tO.  If  tlu’  corr  el  at  ivuis  lise 
0.20,  then  the  result  of  the  discrimination  function  will  be  0 . 2t1 /O . aO , 
or  0.50.  Till’  correlation  rosi'  one-half  of  the  dist.ince  possible.  llie 
b.iseliiu's  should  hi-  selected  durin;;  the  stable  portion  (no  transitions 
goiiu;  I’ti ) of  the  sounds  pri’cedim;  or  siu'Ci'od  i ng,  the  sIi'p.  The  see.ments 
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usi'J  for  d i scr  in’.iiiat  ion  aro  slocteJ  by  tlio  part  i t i on  i nj;  al:;orithm  as 
discussed  in  Chapter  IV. 

For  purposes  of  an  example,  say  the  points  circled  in  Figure  27 
on  page  61  are  selected  as  the  baselines  for  the  three  stops.  If  the 
following  discrimination  function  is  used 

V ■ -V 

(2) 

1.0-Vsi 

where:  = the  discriminated  result  for  the  window  function  of  the 

ith  stop  variation 

V • = the  baseline  for  the  ith  stop  variation 

SI 

= the  result  of  the  correlation  of  the  current  segment  of 
the  incoming  signal  against  the  ith  stop  variation 

for  each  segment  of  Figure  27  on  page  61  the  result  is  as  pictured  in 
Figure  28  on  page  63.  Again  it  is  obvious  that  a B was  being  analycod 
but  this  time  simply  taking  the  highest  correlation  result  is  suffi- 
cient. The  same  procedure  is  used  for  the  channel  firing  correlation. 

In  the  above  example,  the  masters  used  for  the  shut  down  of  the 
stop  could  have  also  been  adequate  for  the  release  of  the  stop  but  this 
is  not  normally  the  case.  Normally  the  shut  down  and  release  of  a stop 
differ  greatly  and  different  masters  are  required  for  each.  Therefore, 
for  a sfuit  down  of  a stop  only  the  segments  just  prior  to  the  silent 
period  are  examined  aiul  for  tlie  release  of  a stop  only  the  segments  just 
after  the  silent  period  are  examined.  For  the  sluit  down  tlie  baselines 
from  tht'  preceding  sound  are  used  and  for  the  release  the  baselines  from 
the  succeeding  sound  are  used. 

Kach  stop  mav  have  several  reference  patterns  for  the  shut  down 
and  Si’v.'ral  n’terence  patterns  for  the  release  an,i  there  are  two 


corro  l;it  ioti  rosults  fiir  I'ach  shut  down  pattorn  (window  function  and 
channel  firing)  and  two  correlation  results  for  each  release  refer- 
ence pattern.  The  highest  discriminated  result  for  slnit  down  window- 
function,  shut  down  channel  firing,  release  window  function,  and 
release  channel  firing  (total  of  four)  for  each  stop  are  added  and  the 
stop  with  the  highest  sum  is  selected  as  the  most  likely  candidate. 

Asp  i rant  (H) 

The  aspirant  (H)  may  only  be  formed  in  conjunction  with  a suc- 
ceeding voiced  sound  which  is  always  a vowel  or  W in  American  Knglish. 
The  aspirant  is  formed  by  placing  the  vocal  tract  into  position  for  the 
succeeding  sound  and  exciting  the  vocal  tract  with  noise  generated  by 
turbulent  air  flow  through  tialf  open  vocal  cords.  This  whispered  por- 
tion of  the  vowel  or  W lasts  for  approximately  lOO  ms  before  voicing 
begins  witliout  readjustment  of  the  vocal  tract.  The  synthesis  system 
used  here  recognises  ten  basic  vowels.  Therefore,  it  can  be  considered 
that  there  are  11  different  aspirants.  Keference  patterns  for  tlie  11 
sounds  are  loaded  and  used  as  if  they  were  steady-state  soun.Is  with  one 
majv'^r  exception  - if  one  of  the  aspirants  is  determined  to  be  the  most 
likely  candidate  for  a particular  segment,  the  most  likely  non-aspirant 
is  also  recorded.  The  reason  for  this  exception  and  the  vyav  the  alter- 
nate candidate  is  used  will  he  discussed  in  Cliapter  IV. 

Sp(‘c  i a 1 Cases 

Tlie  two  special  cases  are  a stop  in  the  utterance  initi,il  position 
and  a stop  in  the  utterance  final  position.  These  are  special  cases 
tu’cause  biith  the  sliut  down  and  release  portii'us  of  the  stop  mav  not  be 


present  . 


The  initial  portion  of  an  utterance  (beginnini;  of  sample  or  after 
a pause)  is  treated  as  if  it  were  the  release  portion  of  a stop  internal 
to  an  utterance.  Discriminated  correlations  against  the  release  refer- 
ence patterns  of  the  stop  variations  are  performed  but  tlie  sum  of  the 
highest  results  for  at  least  one  of  the  stops  must  exceed  a thresiiold. 
This  threshold  was  more  or  less  arbitrarily  set  at  1.5  and  some  experi- 
mentation with  natural  speech  should  be  performed  to  possibly  determine 
a more  suitable  value.  If  the  tlireshold  is  exceedi'd,  the  stop  with  the 
highest  sura  of  discriminated  correlation  results  is  considered  recog- 
nized. If  the  threshold  is  not  exceeded,  it  is  assumed  tliat  no  stop  is 
present . 

A stop  in  the  utterance  final  position  can  be  formed  either  with 
or  without  a release  portion.  Kre(iuently  a speaker  may  add  a release 
portion  formed  with  a low  level  Ull  to  the  end  of  tlie  utterance. 

However,  the  release  portion  is  usually  very  low  level  and  rudimentary. 
The  second  method  of  forming  a stop  in  this  position  is  to  terminate  the 
utterance  with  the  closure  of  the  stop.  In  either  case,  the  release 
portion  of  the  stop  is  not  available  for  identification.  Therefore,  as 
in  stops  in  the  utterance  initial  position,  the  final  portion  of  an 
utterance  is  treated  as  part  of  a slop.  Discriminated  correlations 
against  the  shut  down  reference  patterns  of  tlie  slop  variations  are  per- 
formed but  the  sum  of  the  highest  results  for  at  least  one  of  the  stops 
must  exceed  a threshold.  Again  the  thresiiold  was  set  at  1.5.  If  the 
threshold  is  exceeded,  the  stop  with  the  highest  sum  of  d i sc r i mi na t ed 
correlation  results  is  consideri'd  recog.n  i zed . 
exceedi’d  it  is  assumed  that  no  stop  is  present. 
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If  the  threshold  is  not 


IV.  Partitioning  and  Phoneme  Identification 


Partitioning  an  unknown  speech  sample  into  useable  size  pieces  is 
a significant  problem  for  any  type  of  automatic  speech  recognition.  In 
many  automatic  speech  recognition  attempts  the  analog  signal  is  parti- 
tioned into  word  or  plirase  size  lengths  by  using  silence  before  and 
after  to  demarcate  the  boundaries.  Usually  the  words  or  phrases  are 
not  recognized  as  sequences  of  phonemes,  but  rather  the  entire  length 
of  signal  is  treated  as  a single  pattern.  This  system,  on  the  other 
hand,  individually  identifies  short  analog  segments  only  a few  milli- 
seconds in  length.  These  segments  are  either  naturally  demarcated  by 
the  source  or  are  demarcated  by  a 10  ms  time  interval.  A voiced  speech 
segment  is  the  result  of  a single  impulsive  type  excitation  of  the 
vocal  tract.  In  either  case,  the  segments  are  sub-units  of  phonemes, 
which  are  the  basic  units  of  speech,  and  it  is  necessary  to  partition 
and  group  the  sequence  of  identified  segments  into  phonemic  units.  The 
partitioning  scheme  is  based  on  measures  whicli  reflect  changes  in  the 
speech  signal. 

Particular  points  in  the  speech  signal  are  selected  as  baselines 
(starting  points)  for  each  of  the  measures  and  the  changes  in  the  suc- 
ceeding segments  are  measured  against  the  baselines.  When  the  distance 
from  the  baseline  of  one  of  the  measures  exceeds  a threshold,  it  is 
considered  that  a change  in  the  speecli  signal  has  been  encountered  and 
the  current  segment  becomes  the  baseline  for  that  measure.  When  two  or 
more  measures  indicate  a change  in  the  speech  signal  within  tliree  seg- 
miMits  of  one  another,  it  is  considered  tliat  a phonemic  transition  is 


taki'.is  place,  a partition  bouiuiary  is  iiuiicatcii,  and  a phoneme  is 
ident  i t ied . 

Ind ^ V idna 1 Part  it  ion  Measures 

Three  independent  measures  are  used  in  this  partitioning  scheme. 
Thev  are  pulse  interval  matrix  correlations  against  lY , AA,  and  00;  the 
raw  moment  of  the  input  signal;  and  the  overall  input  signal  amplitude. 

Tile  first  partition  measure  is  calculated  from  correlations  of  the 
pulse  interval  matrix  of  the  incoming  speech  signal  against  the  pulse 
interval  matrices  of  Che  masters  for  lY,  AA,  and  00.  These  three 
vowels  were  selected  because  they  are  generally  considered  to  represent 
the  three  corners  of  the  vowel  space  and  most  transitions  from  one 
phoneme  to  another  will  cause  a change  in  the  result  of  the  correlation 
of  the  incoming  signal  with  at  least  one  of  these  vowels.  The  correla- 
tion scores  witli  each  of  these  vowels  are  averaged  over  three  segments 
in  order  to  "smooth"  the  parameter  and  thus  filter  out  most  variations 
within  a phoneme.  The  actual  measure  is  the  difference  between  the 
current  average  and  a baseline.  The  results  of  the  correlations  for 
the  second  segment  encountered  in  an  utterance  are  used  as  a baseline 
for  the  first  partition.  Tlie  Euclidean  Distance  is  calculated  from  the 
baseline  to  the  current  average  for  each  new  segment.  When  the 
resulting  value  exceeds  1.5,  it  is  considered  that  a change  in  the 
speech  signal  has  been  encountered  and  the  baseline  for  tliis  measure  is 
moved  to  the  current  average.  The  process  is  tlien  repeated. 


Tlie  second  partition  measure  is  based  on  the  raw  moment  of  the 
incoming  signal,  as  calculated  in  Chapter  111.  The  raw  moment  of  the 
current  segment  is  averaged  with  the  raw  moments  of  the  pi'eceding  two 


betweou  Che  cur- 


seemeiits  tor  smoothing;.  The  measure  is  the  Jifterer.ee  between  Che  cur- 
rent aver.iite  and  a baseline.  Again  the  baseline  is  originally  the 
second  segment  ot  the  utterance.  The  baseline  is  subtracted  from  the 
current  raw  moment  average.  When  Che  absolute  value  of  the  result 
exceeds  3000  it  is  considered  that  a change  in  the  speech  signal  has 
been  encountered  and  the  baseline  for  this  measure  is  changed  to  the 
current  average.  The  process  is  repeated. 

The  third  partition  measure  is  based  on  the  overall  input  signal 
amplitude.  Pulses  on  the  amplitude  indicator  channel  (second  CxC 
channel)  occur  at  a rate  logarithmically  proportional  to  the  overall 
sic.nal  amplitude.  The  actual  calculations  are  based  on  the  time  between 
the  last  amplitude  marker  pulse  encountered  and  the  one  previous  Co  it. 
If  no  amplitude  marker  pulses  are  encountered  (amplitude  marker  pulse 
interval  is  longer  than  the  segment)  within  a segment,  Che  amplitude 
value  of  the  last  segment  is  carried  over  to  the  new  segment.  This 
measure  is  also  averaged  to  help  filter  out  local  perturbances  and  again 
the  baseline  is  originally  the  second  segment  of  the  utterance.  It  is 
considered  that  a change  in  the  speech  signal  has  occurred  when  the 
absolute  value  of  the  difference  between  the  current  amplitude  average 
and  the  amplitude  baseline  exceeds  1.5,  which  corresponds  to  approxi- 
mately 4 db.  Again,  the  baseline  is  moved  to  tlie  current  average  and 
the  process  repeated. 


P .1  r t i t i on  boundaries 

Wlien  two  or  more  partition  measures  indicate  a change  in  the  speech 
signal  within  three  segments  of  one  another  a partition  is  consi.iered 
ciimplete  and  a partition  boundary  is  indicated.  However,  during 


('8 


phonome-to-phoneme  transitions  speech  patterns  may  change  enough  within 
a few  segments  that  boundary  conditions  may  be  met  several  times  within 


a single  transition.  To  prevent  multiple  boundary  markers  in  such  a 
situation,  the  partitioning  algorithm  does  not  permit  two  boundary 
markers  to  occur  within  four  segments.  Further,  should  the  boundary 
conditions  be  met  within  four  segments  of  the  last  time  they  were  net, 
not  only  is  a boundary  not  marked  but  the  boundary  marker  is  inhibited 
for  four  more  segments.  At  initial  start-up  the  boundary  marker  is 
inhibited  for  five  segments  to  allow  the  system  to  settle  down. 

Phoneme  Ident i f icat ion 

Once  the  partition  boundaries  are  determined,  the  steady-state 
phonemes  are  ranked  by  the  number  of  times  they  were  recognized  at  the 
segment  level  within  that  partition.  The  phoneme  which  occurred  most 
often  is  identified  as  the  partition  phoneme.  If  more  segments  of  H 
were  recognized  than  any  other  steady-state  phoneme,  H is  identified, 
but  the  second  most  likely  phoneme  is  also  recorded.  To  preclude  false 
identification  of  a phoneme  during  a transition,  the  phoneme  identified 
for  a partition  must  have  occurred  at  least  three  times.  Regardless  of 
whether  a steady-state  phoneme  is  identified  or  not,  each  time  a parti- 
tion boundary  is  indicated  the  tally  of  phonemes  recognized  at  the 
segment  level  is  restarted. 

When  a steady-state  phoneme  is  identified  with  a particular  parti- 
tion, two  checks  are  made  before  it  is  accepted  into  the  final  phoneme 
string  output.  First,  if  the  phoneme  is  the  same  as  t!ie  last  phoneme, 
a spurious  bcjundary  is  assumed  and  the  current  phoneme  is  ignored. 
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Si'c.'ii.i  , 

it  th.'  previ.uis  phon.'m.' 

w.is  an  II 

.111.1  t he  CU  1 1 

iMlt 

ph.’iu'me  is 

lU't 

.1  V.’W.'  1 

.'r  W,  the  11  is  r.'place.l 

bv  t h.'  s.’ 

'C.'ii.l  m.'s t ! V 

1 ikt 

' 1 V phi’ll. 'ill.' 

1 i'r 

till'  pri’vioiis  p,ir  t i t i mi . 

(.\'i;ib  i 11.1 1 i v'li.i  1 Si'uiuls 

Tlu'  pluiiiiMm'  striiii^  is  iilsii  i*x:iiui  luui  t ii  ii  1 1 ow  t lu*  rt'o  op.n  i t i ^I'l  ^it 
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raa.ic  lip  oi  two  plioiUMiu's.  Tli  i s catcipiry  iiicliuics  the  d i pli  I honp.s  aiul  the 
a t i r i cat  cs . 

Till'  diplithoiips  ^EI  , AI  , 01,  OU , aiul  AlO  ar>'  c,c  no  ra  1 1 y thouplit  of 
as  two  vowels  ill  taiiulem.  In  both  natural  aiul  synthetic  Sju-ecli  ihe 
pen.' rat  Of  starts  at  or  lu-ar  th.'  tarp.'ts  for  the  first  vowel  ami  mi r.i  I es 
towar.l  the  tarpets  of  th.'  seemul  tarpets.  Natural  speak.'is  .lo  net 
alwavs  rcacli  the  secoiul  tarpets.  Hi  i s teiuienc.  .«as  .ils.'  built  into 
the  speech  svntii.'sis  system  tli  it  was  us.'.l  in  tliis  .'X.'rcise.  In  KI  , 

Al,  ami  01  the  s.'con.i  tarpet  is  lY  but  th.'  sp.'ak.'r  freal  or  synt  li.'t  i c 1 
may  .'iily  r.'acli  its  cl.isest  n.' i p.hboi' , 11.  Th.' r.' f .'re  , t h.'  lirst  soir.'Os 

(KK,  AA,  ami  OV)  in  combination  with  II  or  lY  must  be  con  s i .1.' i .'.1  com- 
pl.'te  .1  i pli  t honp.s  . 

Till'  affricat.’s  (Oil  .iiui  .0  are  sy  n t h.'s  i r.'.l  by  comb  i n i ii;.',  T ami  Sll 
for  t'b,  .111.1  !)  ,111.1  hll  for  .1.  In  this  r.'c.ipn  i t i .ni  syst.'m  ..'henevi'r  th.';;.' 
.'.’mb  i ii.i  t i lUis  an'  .'iiciiiint  .'r.'.l  , th.'  app  r.'pr  i .i  t .'  .illricat.'  is  i.i.'iit  i t i .'.1 . 
i'll.'  .1 1 I r i c.i  t .ir.'  .nu'  area  in  wh  i cli  t lu'  r.'c  .’j'.n  i t i .ni  .'1  svnthet  ic  sp.'.'.h 
iiav  . litter  pte.it  ly  fr.'m  th.it  ot  n.itnr.il  sp.'i'ch.  b.'.-.ius.'  t In-  attricil.'s 
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Thi’  phouomi'  stiiii)',  is  i’x;imi  iii'J  it  t iu’  ciiimmiI  I'tiiMiiv’n'  is 

possihlv  Uii’  si'CiMut  phoiu'nu'  ol  a diphtlionp,  i)r  a t t r i ca  t i' , ilia  [iri'V  i on-; 
phonomo  is  cliockint  to  soo  il  it  is  tin-  first  part.  II  it  is.  tin’  two 
aro  ri'plaoi’il  hv  tho  .ippri'pr  i a 1 1'  liiphthonp,  or  altricati'.  Also,  it  tin- 
ourr.’nl  phoiiono  is  thi-  soroiui  part  oi  a Ji  plu  luni;.;  or  allricato  an.i  tho 
provious  iihoiiomo  is  that  dii'htlioip^  or  attrioatc,  tho  oiirn'iit  phoiK’ino  is 
ipnorod . 

S t on  0 i s cj-  i 111  i ii^t  i on 

As  luitod  in  Chaptor  111,  tlio  valuos  of  t lio  corrolations  apainst 
tho  various  stop  mast  ors  iiuist  ho  d i sor  i minat  od  . Hi  so  i' iiiii  nat  iiin  is  dinio 
by  norma  1 i r,  i lip  thi'  corrolation  losnlts  dnrinp  a stop  to  tho  ov'rrolation 
rosnits  for  sopmonfs  that  aro  assnmod  to  bo  part  ot  tho  stabilisod  por- 
t ion  ot  tho  pi'ooodinp  and  siiocoo^iinp  siinnds.  It  is  .assnau’ii  whoa  binni” 
dary  conditions  liavo  not  boon  mot  tor  four  sopiiiont  s (.boniuiarv  iiiarkor  is 
no  lonpor  inliibitod)  that  tho  sipiial  paramotors  havo  more  or  loss 
stabilizod  and  tho  ourront  sopmont  o.in  bo  iisod  tor  no  riiia  1 1 7,  i np,  tho  stop 
corrol  .It  ions . 

Whon  sipnal  paramotois  aro  oousidorod  s t ah  i 1 i .-.od  tor  t lio  first 
t i 1110  in  an  nt  tor. moo,  a tost  is  made  to  soo  it  tho  ut  tor. moo  bop.iu  with 
.'1  stop.  Tho  slop  oorro  1 .1 1 1 oils  lor  tho  roll'. iso  pint  ion  ot  tho  various 
stop  rotoioui-o  p.ittorus  aro  no  rm.i  1 i .lod  and  tho  hiphost  o or  ro  1 .i  t i on  s 
(uiiidou’  liinolion.s  and  ohamiol  lirino.^  for  oaoh  stoji  .no  addod.  It  tho 
sum  tor  .mv  ot  thoso  oxooods  I.''.  .a  stop  is  assumod  to  bo  prosoiil  .md 
is  idonti  til'd  .is  tho  stop  Vv'ith  tho  liij'.liost  ol  thoso  sums.  U'hothoi  .i 
stop  IS  rooop.ni.-od  or  not  , tho  shut  down  port  ion  oor  ro  1 .1 1 i ons  .aro 


rk'coriioii  in  caso  tho  noxt  plionomn  is  dotormi  nod  tv)  bv'  a siv)p  anvi  tlii'v 
ai  t?  neodotl  tor  norma  i i r.at:  ion  . 

It  a silonC  poriod  in  excoss  ot  35  ms  is  vlotoctod  vinrinp  a parti- 
tion internal  tv)  the  ntteraneo,  it  is  assumovi  that  .i  stop  is  piv'sent 
but  the  system  continuv's  until  the  sip,n.il  paramv'ters  .ire  assumed  to 
have  stabti.i;:ed  in  the  next  p;irtitiv)n.  It  .1  stop  is  consivi.'red  to  be 
present,  the  stint  down  pv)ition  ot  tlie  st.ip  cvirre  lat  ions  are  norm.ilized 
by  tile  reeorded  v.ihies  trv'in  tlie  List  pvirtition  .iiivl  tlie  relea.se  portions' 
of  tho  stop  correlations  are  norm.ilizevi  by  the  results  ot  the  correla- 
tions against  the  current  segment.  The  four  correlations  (shut  down 
.and  release  of  botli  window  functions  and  channel  firing)  for  e.icli  stop 
variation  are  summed.  For  a stop  internal  to  the  utterance  there  is  no 
threshold  requirement  aiivl  the  stop  Vv'ith  the  higdiest  sum  is  idv'iit  1 f ted . 
If,  on  the  otlier  luind , a stop  is  not  considerv’vi  to  be  present,  .a  stop 
is  not  identified.  Kither  w.-iy  , the  current  corn' l.it  ions  vig.ainst  the 
shut  down  portions  ot  the  v.irious  stop  m.asters  are  recordv’d  in  case  the 
next  phoneme  is  determined  to  be  a stop. 

When  thv>  end  of  an  utterance  is  encv'unt  ered , a test  is  m.ivU'  to  see 
if  it  envied  with  a stv)p.  The  shut  dviwn  pvirtion  reterv'ncv'  cone  1 .it  ions 
are  normalized  by  tlie  values  rv’cvirded  from  tlu’  last  partition  and  .are 
avivievi  for  e.ich  stop  variation.  If  any  of  these  sums  exev'v'vls  I."),  a 
stop  is  Cvins  ivli'rv'vl  tv)  bv-  presv'iit  an.i  is  ivlv'iit  i f iv'vi  as  the  stop  witli  tlie 


lii  ghost  such  sum. 


V.  Evaluation,  Kesults  aiui  KocommenJat  ii)ns 


Evaliiat  ion 

The  purpose  of  tliis  dissertation  was  to  prodvice  a system  tliat  would 
accept  the  acoustic  output  of  a particular  speech  synthesis  system  and 
produce  an  accurate  written  representation  of  the  input.  In  all  cases, 
the  parameters  or  characteristics  used  in  the  recognition  of  the  syn- 
thetic speecli  are  believed  to  also  be  present  in  natural  speech. 
Occasionally  some  natural  speech  analysis  was  performed  along  with  th.e 
analysis  of  the  synthetic  speech.  However,  evaluation  of  system  per- 
formance during  development  was  done  on  isolated  synthetic  phonemes 
whenever  possible.  The  overall  accuracy  of  the  recognition  of  synthetic 
sti-ady-state  phonemes  in  isolation  (unconnected)  was  excellent.  The 
system  did  make  occasional  errors  on  individual  segments  but  rarely  mis- 
identified  or  missed  a steady-state  phoneme.  Obviously,  development 
and  evaluation  of  the  stops  (B,  D,  G,  P,  T,  and  R)  and  the  aspirant  (H) 
had  to  be  done  in  combination  with  other  phonemes  because  these  phonemes 
cannot  occur  in  isolation  and  because  the  adjacent  sounds  are  known  to 
affect  the  characteristics  of  these  sounds.  The  system  accuracy  on 
stops  and  II  in  "isolation"  was  very  good. 

Phonemes  rarely  occur  in  isolation  in  spei'ch ; more  often  they 
occur  in  connected  sequences  to  form  words  and  phrases.  Testing  of 
overall  system  performance  was  perfvJrmed  on  isolated  words  whicl>  per- 
mitted evaluation  of  the  phoneme  based  recognition  system  with  con- 
nected phoneme  strings  hnt  stopped  short  of  requiring,  development  of 
word  honnd.iry  rnlos.  The  word  lists  vised  in  the  tests  were  dv'Vi' 1 opvui 
by  thi’  Gentral  Institute  for  the  Pi'. if  ICIP)  aiiii  arv’  phonemicallv 


b.i  1 anci'ii  lists.  Tho  fri-'quiMicy  ut'  uccumMU'o  of  tho  various  phouiunos  in 
oach  list  approximates  tho  froipiency  of  occurrence  in  American  speech. 

Resu  Its 

Tvo  cons  iderat  ions  that  were  used  in  analyzinp,  the  lesults  of  the 
svstem  tests  on  the  CID  word  lists  should  be  noted  before  discussion  of 
the  results.  First,  YY  (as  in  ^ou ) is  a sound  that  starts  with  a short 
lY  (as  in  b.e^  and  then  glides  toward  the  next  sounii.  A separate  YY  is 
necessary  in  speech  synthesis  but  is  almost  impossible  to  distinguish 
from  a short  lY  in  speech  recognition.  Therefore,  YY  was  deleted  as  a 
possible  candidate  and  a recognir.ed  lY  for  a YY  was  considered  correct. 
The  second  considerat ion  was  that  the  difference  between  an  KK  (as  in 
run)  and  an  KR  (as  in  lu]_r)  is  so  small  tliat  they  can  almost  be  con- 
sidered a single  phoneme.  Therefore,  recognition  of  one  for  the  other 
or  a sequence  of  one  and  then  tlie  other  vcas  considi'red  correct. 

The  output  of  the  system  is  a segmon t-b v-segment  printout  and  a 
printout  of  the  final  plioneme  string  after  the  data  for  the  utterance 
is  fully  processed.  Figure  Hd  on  page  7S  is  a typical  system  output. 

The  segment-by-segment  printing  is  one  line  co\itaining  tlie  most  likely 
candidate,  the  p.irtition  marker,  th^'  partition  inhibit  value,  and  tlie 
raw  moment.  A partition  boundary  is  indicated  by  si'tting  tho  partition 
marker  to  one.  The  partition  m.irker  is  inhibited  as  long  as  tho 
partition  inhibit  value  is  gro.ator  than  or  I'qual  to  .;ero.  The  raw 
moment  is  included  merely  as  a gross  indication  of  tho  stability  of  tho 
input  spoi'ch  signal.  If  an  Mil  is  idtuiiiliod  for  a particular  sog.mont  , 
tin'  second  most  likely  candiilato  is  printed  and  an  MI!  is  printed  lui  the 
next  line.  Tho  systi'm  also  out]ints  .1  printout  of  tin'  hig.liost  correlation 


Raw 

Moment 


of  the  window  functions  and  channel  firing’s  for  each  stop  wtien  a stop 
is,  or  may  be,  present.  The  correlations  are  printed  when  the  parti- 
tion inhibit  value  reaches  zero  for  the  first  time  (initial  stop 
possible),  when  it  r<.  aches  zero  after  a silent  period  grt'ater  than  35  ms 
has  been  found  (internal  stop),  and  when  the  end  of  data  is  reached 
(final  stop  possible).  After  the  data  for  an  utterance  is  processed, 
the  system  outputs  the  final  phoneme  string  formt'd  as  a result  of  tlie 
analysis  of  the  utterance. 

The  method  of  system  evaluation  was  to  compare  system  output  with 
known  inputs,  namely,  the  phonemic  input  of  the  CID  word  lists  to  tlie 
synthesizer.  Tables  II  and  III  on  pages  77  through  SO  sliow  the  words 
used,  the  phonemic  spelling  used,  and  the  system  output  tor  the  two 
word  lists  used.  System  errors  are  underlined.  There  were  a total  of 
281  phonemes  input,  of  which  245  were  correctly  identitied,  23  were 
m is-ident  i f ied , 13  were  missed  entirely,  and  11  were  added  (Table  IV 
on  pages  81  and  82).  The  sum  of  the  mi  s-ident  i f i ed , missin.g  and  added, 
divided  by  the  total  input,  gives  a simple  error  rate  of  lb. 7*.  However, 
many  of  the  errors  are  predictable  or  understand.ib  le  and  nuiy  be  overcome 
at  a higher  (word  or  phrase)  level,  Kigures  33  through  nl  on  pages 
through  122  present  the  segment -by-segment  printouts  of  all  words  which 
contained  errors. 

Ana  lysis 

Some  errors  in  phoneme  identification  occurred  even  though  the 
sei|uence  of  seg.miuit  i dent  i f icat  i ons  was  liki’lv  ciirri'ct  . Thest'  errors 
are  involved  with  the  t ra  jec  t vir  i es  (movemiMil)  ol  the  spi'ecli  through  the 
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TABLi  II 

CID  Fhonomically  Balanced  '.'ord  List  One 


word 

Phonemic 
Sreliin  ■: 

Sys  tei 
Outru 

1 . 

ace 

EISA 

El  33 

2. 

ache 

EIKK 

SIKK 

5. 

an 

AENN 

AEMN 

4. 

as 

A EBB 

A EBB 

5. 

battle 

BBAETTLL 

BBAETTLL 

6 • 

bells 

BBEELLBZ, 

BBEELLBB 

7. 

carve 

KKAARRVV 

AARKTH 

8. 

chow 

CHOO 

CHOO 

9. 

could 

KKUUOD 

KKU'JPD 

10. 

dad 

nOAEDD 

A EDO 

1 1 . 

day 

DPEI 

DDEI 

12. 

deaf 

DDCEFF 

DDBEFF 

13. 

earn 

ERNN 

EKNR 

19. 

east 

IY33TT 

lY33rT 

13. 

felt 

FFEELLTT 

FFEE^ULlj 

ifa. 

give 

GGIIVV 

GGIIVV 

17. 

high 

HHAI 

HHAI 

18. 

him 

HHIIV.F. 

IIME 

Id. 

hunt 

HHUHNETT 

HHUHNHTT 

20. 

isle 

AILi. 

AIM. 

21  . 

i t 

IITT 

IITT 

• 

Jam 

JJA'.O'.M 

n V 

2'. 

knees 

N'N'IYLB 

NNIYBB 

29. 

law 

LLO.V 

1,1,0  .V 

V • 

low 

LIOU 

11.07 

77 


TABLK  II  (Con't) 

GID  Phonemically  Balanced  Word  List  One  (Con't) 


Phonemic 

Snellini: 


System 

Outt)ut 


KKIY 


Word 


2b . 

me 

27. 

mew 

23. 

none 

29. 

no  t 

30. 

or 

31 . 

owl 

32. 

poor 

33. 

ran 

34. 

see 

35. 

she 

3b. 

ski  n 

37. 

stove 

53. 

them 

39. 

there 

40. 

thinjt 

41  . 

toe 

42. 

true 

43. 

twins 

44. 

up 

45. 

us 

4b . 

we  t 

47. 

what 

43. 

wire 

49. 

yard 

50. 

you 

MMIY 

MMYYOO 

NNUIINN 

NNAATT 

OURR 

AU.VWLL 

FPOURR 

RRARNN 

5SIY 

3H1Y 

S3KKIINN 

S3TT0UVV 

TSKIGIM 

TERERR 

THIINNGG 

TTOU 

TTRROO 

TT',V.VI1NNS3 

UHFF 

UH3S 

W\V  XETT 

HHVWUHTT 

..WAIRR 

YYUIIRRDD 

YYOO 


KKIYRROC 

NN^NN 

NNAATT 

HHOCRR 

AAraOO LL 

FPOURR 

DDUURENN 

3SIY 

SHIY 

33KKIINN 

SSTTOUVV 

TEE’.GIM 

TEE ERR 

THIINNGG 

TTOU 

TTRROO 

TTWWIINNSS 

UHFF 

UH33 

■.vv;n 

HH UUTT 

HHAIRRDD 

lY RRER 

I YOO 
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TABLK  HI 

CID  Phonemically  Balanced  Word  List  Two 


Word 

Phonemic 

Spellin:’ 

3ystem 

Output 

1 . 

ail 

BILL 

BILL 

2. 

air 

EERR 

EERR 

3. 

and 

EENNDD 

EENNDD 

4. 

been 

BBIINN 

BBIINN 

5. 

by 

BBAI 

BBAI 

b . 

cap 

KKAEPF 

KKAEPP 

7. 

cars 

KKAARRSS 

KKAARRS3 

8. 

chest 

CHEE3STT 

3HIIEES3TT 

9. 

die 

DDAI 

DDAI 

10. 

dees 

DDUHLZ 

DDUHEE 

1 1 . 

dumb 

DDurc-iM 

DDUILMK 

12. 

ease 

lYZZ 

IYS3 

13. 

eat 

lYTT 

lYTT 

14. 

else 

BELL S3 

BELLS  3 

15. 

flat 

EFLLAETT 

FFLLAETT 

1 6 . 

gave 

GGEIVV 

IIEIVV 

17. 

ham 

HHAtUlM 

IIKAEMM 

18. 

hi  t 

HHIITT 

IITT 

19. 

hurt 

HHERTT 

HnERRRTT 

20. 

ice 

AISo 

AIS3 

21 . 

ill 

I ILL 

I ILL 

22. 

jaw 

JJOW 

DDHHUUO'.V 

23. 

key 

KKIY 

KKIY 

24. 

knee 

NNIY 

NNIY 

25. 

live 

LLIIVV 

LLIIVV 

79 
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TABL'C  rn  (Con*  t) 

CID  Phonemically  Balanced  Word  List  Two  (Con*t) 


'.Vord 

Fhonemic 

Srellinf* 

3ys  tern 
Outnut 

26. 

move 

y.MOOVV 

^iMC0VV 

27. 

new 

NNOO 

NNOO 

23. 

now 

NNAU 

NNAU 

29. 

oak 

OUKK 

ou 

30. 

odd 

AADD 

AADD 

3' . 

off 

OV/FF 

O'.VFF 

32. 

one 

'VWUHNN 

VAVUHNN 

33. 

own 

GUNN 

OUNN 

34. 

pew 

PFYYOC 

TTVVOO 

35. 

rooms 

RRCCMKSS 

DI>1M00NNS3 

36 . 

send 

3SSENNDD 

S3EENNDD 

37. 

show 

SHOU 

3SUU0U 

33. 

smart 

33MKAARRTT 

SSMMAARRnTT 

39. 

star 

3STTAARR 

33TTAARR 

40. 

tear 

TTIIRR 

TTIIRR 

41. 

that 

TSAETT 

THFETT 

42. 

then 

TRFSNN 

EENN 

43. 

thin 

THIINN 

THIINN 

44 . 

too 

TTOO 

TTOO 

45. 

tree 

TTRRIY 

TTRRIY 

46 . 

way 

■'.".VEI 

MM  I II Y 

47. 

well 

.VvJEFLL 

WiVEELL 

43 . 

with 

’.V.VIITH 

T.!  1 1 HH 

49. 

younr; 

YYUHNNGG 

lYTHRRNNGG 

50. 

your 

YYUURR 

lYUUER 
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TABLE  IV 


Recognition 

Statistics 

Phoneme 

Total 

Number 

Total 
Correc  t 

Number 

Added 

Totally 

Missed 

Mis-Ident- 
ified  As 

lY 

10 

10 

- 
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— 

II 

15 

13 
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EE 

13 

12 

- 

- 

II 

AE 

10 

7 

- 

- 

EE  j EE  j EE 

AA 

6 

6 

- 

- 

— 

UH 

10 

6 

- 

1 

RR,  RR,  UU 

UU 

2 

2 

3 

- 

— 

CO 

9 

9 

- 

- 

— 

cv; 

5 

3 

- 

- 

— 

ER 

2 

2 

- 

- 

— 

El 

6 

5 

- 

- 

IIIY 

AI 

6 

6 

- 

- 

— 

01 

0 

- 

- 

- 

— 

OU 

8 

7 

- 

- 

HHCO 

AU 

2 

2 

- 

- 

— 

WW 

9 

5 

- 

2 

MM,  HH 

LL 

13 

13 

- 

- 

— 

RR 

16 

14 

2 

- 

UU,  MM 

YY 

6 

5 

- 

- 

vv 

KM 

10 

9 

- 

- 

NN 

NN 

22 

22 

- 

- 

— 

NG 

0 

« _ _ 
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TABLE  IV(Con't) 


Recognition  Statistics 


Phoneme 

Total 

Number 

Total 
Correc  t 

Number 

Added 

Totally 

Missed 

Mis-Ident- 
ified  As 
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FF 
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- 

- 

— 

TH 
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SS 

15 

15 
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3H 

2 
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— 

CH 

2 
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- 

- 

SH 

JJ 

2 

- 
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- 

DD,  DDHH 

HH 

7 

4 

- 

2 

KK 

BB 

4 

4 

- 

- 

— 

DD 

12 

10 

2 

2 

— 

GG 

4 

3 

- 

1 

— 

PP 

4 

3 

- 

- 

TT 

TT 

25 

21 

- 

2 

— 

KK 

8 

6 

- 

2 

— 

Totals 

281 

245 

11 

13 

23 

speech  pattern  space.  In  order  to  more  easily  visualize  the  problem  of 
trajectories  In  the  speech  space,  the  concept  of  formant  targets  must  be 
presented.  Every  speech  sound  can  be  thought  to  have  formant  targets 
associated  with  it.  In  the  case  of  voiced  sounds  (vowels,  semi-vowels, 
nasals,  and  voiced  fricatives),  the  formants  actually  migrate  from  the 
previous  sound  to  the  appropriate  targets.  In  the  case  of  a voiced 
sound  followed  by  a fricative  or  stop,  the  formants  move  toward  the 
appropriate  targets  but  voicing  stops  (for  fricatives)  or  the  amplitude 
drops  (for  stops)  prior  to  the  arrival  at  the  targets.  In  the  case  of 
a fricative  or  stop  followed  by  a voiced  sound,  the  formants  move  away 
from  the  target,  toward  the  voiced  sound  but  voicing  starts  or  ampli- 
tude rises  after  the  formants  have  left  the  original  targets.  The  tar- 
gets for  stops  and  fricatives  are  referred  to  as  virtual  targets.  In 
our  synthesis  system  the  formants  move  from  one  target  to  another  in  a 
manner  that  can  be  modeled  as  an  exponential  function.  That  is,  they 
move  rapidly  away  from  the  locus  of  the  previous  sound  but  slow  down  as 
they  approach  the  targets  of  the  succeeding  sound  (see  Appendix  A). 
Figure  30  on  page  o-t  is  a formant  one  versus  formant  two  plot  of  the 
formant  targets  of  the  various  sounds.  Figures  31  and  32  on  pages  85 
and  86  are  similar  plots  for  formant  three  versus  formant  two  and 
formant  three  versus  formant  one,  respectively. 

Several  of  the  system  errors  noted  in  Tables  II  and  111  on  pages  77 
through  so  are  thought  to  be  a result  of  a combination  of  the  current 
algorithms  and  the  trajectories  of  the  sounds  in  the  speecli  space. 

(NOTE:  in  the  following  cases  all  words  in  the  CID  word  lists  will  he 
referred  to  by  list  number  and  word  number  in  a shorthand  notation. 

For  example,  list  two  word  one  would  be  referred  to  as  L2W1 . 1 In 
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phonomo.  Obviously  this  rulo  is  too  simplist  io.  I'ho  sop,mont~bv~ 
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l.JU'rt  /viiast/  CKij;.  't7  ),  l.J'.Vlo  'p.av.-/  (Ki;.',.  .',K  ),  7 j.iw/  (Kip.  >1  ), 

I.'J’.vt/  /slu'w/  (Ki  ;.', . , aiul  l.'JW-V*  /yoiiiu’, / (Kip.  i-l  K 

Aiii'tlu'r  tvi'O  I't  error  ol'si’tvivi  is  t tu’  iieaii’st  iieiplihor  error; 
eaeti  s.'pi'.u'ut  ot  a striiip  is  i iieor  ii-e  t 1 y i.leiit  i t i e.l  as  a nearl'v  iieiplil'or 
aiul  the  phoiuMiu'  is  eoiisi'iiuent  1 y uieiititieil  iiieorreetly  as  that  tie  i i.hl'or . 

Tills  error  is  tre>iueiitlv  ohst'fveh  i ii  human  listiMiiui’,  panels  evalnat  inp 
natntal  speeeh.  In  that  ease  it  is  not  known  whether  tin'  error  is  mahe 
hv  the  spi-aker  or  the  listener.  However,  in  onr  sy  n t he  s i .'.oH  spi'eeh  we 
ari'  ipiite  eertain  tliat  t lie  pro]'er  taryets  were  nseH  in  the  svntlu'sis 
stratepv.  Yet  in  I.IWTS  /none/  (Kip.  I'l  1 amt  I.l’W'i'l  /yonn;;/  (Kip.  til  ^ 
the  sv'innl  UH  has  been  ihentitii-H  as  KK.  It  is  inti-restinp  to  note  that 
in  both  ot'  thesi"  eases  the  vov.-e  1 is  assoeiate.l  with  a nasal;  in  t he 
lirst  ease  it  is  snrroninieit  hv  nasals  ami  in  tin'  seeoiui  it  is  preeedeil 
hv  a soniui  soiiu'what  near  the  nasal  NN  in  the  pattern  spaee  ami  liillowevl 
hv  NN.  Other  exampK's  ot  this  tvpe  ot  erivir  are  seen  in  I.I'a'U  /ran' 

(Ku;.  tJ  ^ where  KK  is  i.lentitieii  as  lUl , apain  in  assoei.it  ion  with  a 
n.is.il,  ami  I.l’UVtti  /wav/  (Kip.  'o'  ^ where  KK,  is  iiientitioH  .is  11.  Tlu’ 
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l.ii'I  111. It  tin'  sv.'ilein  m.iki's  errors  in  tliesi'  situ, it  ions  whein-  hum. in 
oh.'it'ivers  .ire  .i  I .so  i|niti'  likelv  to  iii.ike  simil.ir  errors,  lemis  some 
i-re.lenei'  to  the  el. inn  th.it  the  speeeh  reeop.nition  system  simnl.ites  ii'.il 
, unlit, >ry  system  t nne  t i iins  . 
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All  a 1 1 t hv.is  .invl  all  palti'iii  aliai  ul  'iistias  la  >1  ’Vi' i ap  i ti,’. 

I hi  s systam  ,ifa  vaiv  .’a'liaial  aiul  .la  ii>'I  iiiaka  iisa  a!  a I t !' i Ini  t as  tliil  ai 
iiniijiia  ta  spamh.  U is  haliava.l  that  t lu-  a lusa  tai  this  laat  is 
twa-talii.  Kifst,  it  inav  In'  haaausa  t ha  authar  is  an  Mav'triaal  I'na.iiu-ai 
.iiul  uat  a spaaialist  in  spaaah  praJuativ'n  at  liaaitnp.  Shn'an.l,  it  laav 
ha  baaan.sa  aha  r aa  t a r i s t i a s nnuina  ta  t ha  spin'ah  s vnt  In' s i /at  ahsalntalv 
wata  nat  ti'  ha  usa^i  ainl  t ha  authat  iiiav  havi'  pa'na  availn'ainl  in  this 
ataa.  Thaia  ara  at  li'ast  thraa  aiaas  whati'  ahaiaatai  istias  nnuina  ta 
spaaah  aanl.i  prahahly  ha  nsaJ  with  aan  s i ^la  i ah  1 a In-in' t i t . Kiist,  that.' 
aanl.l  ha  a vaiaavi  vaiv'alass  da  t a i .'.u  nat  ivni  that  ai'nld  at  laast  ladnaa 
t ha  luniihar  ai  aandidatas  tai  a paitianlan  s>nnnl . ih  is  i n t ai  i.ia  t i an  is 
finulilv  availahla  in  tin'  aaiiipnt  ar  sinaa  pitah  pativ'd  niaiUai  pnlsas 
nari'.iallv  aaanv  dniinp  Vv' i a i na,  and  aia  ahsant  dntina,  vaiaaK'ss  Sv''nnds. 
Hawavar,  this  inlannatian,  whiah  is  ai'iisidaiavl  hv  plunia  t i a i an  s ta  hi' 
t ha  mast  hasia  and  siiiipla  taatnia  at  spai'ah  aha  taa  t a t i .-a  t i an  , is  nat 
ntili.'.ad  in  t ha  prasant  phanamia  i dan  t i I i a a t i an  praaass.  Saaand  . tat  a, 
raa,  n 1 .1 1 i t V , and  aiiiplitnda  at  t ha  tit  si  iat  last'  taw  pitah  paviads  at 
t ha  ansat  iar  aassatianO  at  vaiaina,  is  availahla  t n 1 1' i ma  t i an  in  lha 
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At=  torm.iiiC  tar>;i't  t)f  i.-iirri'(it  pliDiii-nii* 

Vi  = vi>locity  the  formant  at 

For  compntor  simulation  ot  this  nu'thoil,  ttu‘  ahi'Vi’  o(|iiation  was  Z- 
t rans  formeil  using  inipnlso  invariant  toi-hniiiiu*  I to  prt-sorvo  tho  ti;no 
response  to  an  impulse)  to  obtain  tlie  following  vliffi'rence  ei|uatii'n 

x(nT)  = 2kx(nT-T)-k2xCnT-2T)tC  l-k)  ’F(nT-r)  t ) 

where 

T=sampling  time  (I’ER) 
x(  nl' )=  formant  position  at  t iiiit-  nT 
k=e-T/T 

t=time  constant  in  ms 

anJ 

FC  nT  ) = format!  t target  at  time  nT 

Formant  data  is  used  to  define  intrinsic  phoneme  durations  .iiivi  is  the 

basic  mechanism  from  which  all  timing  is  control  letl. 

T i me  Constants 

Kacti  formant  mav  move  from  one  target  to  thi'  next  at  dilleii'iit 
rate;  thus,  a time  constant  is  necessarv  for  e.ich  formant  in  tlii'  trans- 
ition. In  this  implementation  there  ari'  31  basic  phi'ui'mes  which  gives 

*)()!  pi'ssible  combinations.  Since  three  formants  are  controlli'il  for 
each  phoneiiie  there  are  2SS3  possible  tinu'  constants.  However,  bv  using 
Ci'rt.iin  approximations  and  phoiu'me  grouping.s  the  number  ol  time  con- 


stants was  fi'duced  to  a mori'  workable  3 1 . 


Foimant  ClKin>;os 


All  tlirfo  formants  may  not  bt>:.’,in  motion  towaiai  thoir  now  tary.ots 
s imnl taneons ly.  In  throo  casos,  vowel-stop,  vowel-nasal,  anti  consonant- 
vowel  the  initiation  of  the  transition  of  the  first  formant  is  delayed 

by  t2-tI  ms  where  t2  is  the  time  constant  for  formant  two  and  t 1 is  the 

time  constant  of  formant  one.  This  delay  serves  to  oraphasi;’,e  the  tran- 
sition of  formants  twi)  and  three  which  are  significant  for  proper  per- 
ception in  these  cases. 

Nasa  1 s 

The  nasal  pole  and  zero  and  the  bandwidth  of  formant  one  are 
shifted  so  that  they  are  in  position  when  the  amplitude  is  switched  for 

tite  nasal  and  are  returned  to  a nominal  value  at  tiu'  end  of  the  nasal. 
These  shifts  take  about  50  ms.  If  the  nasal  is  preceded  by  a voiced 
sound,  the  shift  of  these  values  can  be  heard  in  the  voiced  branch  of 
the  synthesizer  and  this  branch  is  being  excited.  This  effect  is  not 
undesirable  because  the  sliglit  nasalization  of  the  pn’ceding  sound  is 
found  in  natural  speech. 

Kricat i ves 

The  fricative  pole  and  zero  move  in  the  same  manner  as  the  nasal 
pole  and  zero  but  the  movement  is  not  normally  heard.  If  the  sound 
preceding  a voiceless  fricative  is  voiceil,  the  pitch  of  the  last  aO  ms 
of  the  phont'nu'  is  reduced  sli;',htly.  Tliis  is  a clue  that  a voiceless 
tricat  ive  is  coming  and  is  understandable  on  a physical  basis  because 
the  vocal  cords  are  slopping.  In  a voiceless  fricative  tlu'  formant 
targets  are  virtual  targets  and  are  used  only  for  contri'lling  the 


transition  to  and  from  a voiced  sound  and  for  timing.  They  are  not 
excited  during  the  fricative. 

In  a voiced  fricative,  on  the  other  hand,  the  formants  are  excited 
and  when  the  amplitude  of  the  output  of  the  formant  one  pole  exceeds  a 
given  threshold  the  fricative  branch  is  enabled.  The  output  of  the  two 
branches  are  summed  and  each  pitch-period  of  the  output  of  the 
synthesizer  looks  like  a dampened  sine-wave  with  noise  added  above  a 
certain  amplitude. 

Stops 

All  stops  are  characterized  by  a rapid  shut-down  of  the  volume  of 
the  preceding  phoneme  and  a period  of  silence  of  about  100  ms.  The 
release  of  the  stop,  however,  is  determined  by  whether  the  stop  is 

voiced  or  voiceless.  The  voiced  stop  has  a rapid  release  of  voicing  of  i 

the  following  phoneme  and  a slight  overshoot  (''^20%)  of  the  volume.  A ; 

I 

voiceless  stop  has  a short-duration  burst  of  fricative  noise  and  a | 

period  of  aspiration  ("^40  ms)  followed  by  the  onset  of  voicing.  The  ! 

( 

amplitude  of  the  voicing  is  rapidly  increased  to  the  value  of  the  sue-  ! 

ceeding  phoneme.  | 

I 

I 

When  a vowel  is  the  first  sound  in  an  utterance,  the  speaker  per-  | 

1 

forms  a "glottal  stop."  That  is,  a rapid  onset  of  voicing  very  similar 
to  the  release  of  a voiced  stop.  For  example,  the  word  /ate/  in  initial 
position  differs  from  /gate/,  /bait/,  or  /date/  only  in  that  the  point 
of  release  is  the  glottis.  This  synthesis  sclieme  incorporates  the 
glottal  stop. 


Aspir.iiu  H .invi  Win  spcri  up. 


In  Ami't  i«:an  I'nglish  ttu-  aspirant  11  is  alwavs  tal  law'll  'ay  a vowi’ 1 
or  V»'.  In  this  simulation  11  is  Ki'iioratod  by  torminy,  atui  1 onr.t  ban  i ny 
the  succeeJiny  souiui  and  aspi  rat  iny  thi'  first  part  ot  it.  Althouyli 
the  duration  of  the  aspiration  is  context  dependent,  an  averaye  valiu’ 
of  100  ms  was  used.  Aspiration  and  whisperiny  are  accomplished  by 
driving  the  voiced  path  with  the  noise  source  (neg.ative  A^.  I lor  tlie 
voiced  (positive  Ay)  sounds  and  are  easily  accompl i slu’d  in  this  simu- 
lation. 

1)  i pit t hongs  and  Af  f r i cat  i ves 

The  diphthongs  KI,  00,  Al  , 01,  and  AO  are  treated  ,is  two  vowt'l 
secjuences,  namely  EE-^I  I for  El,  OW-hHI  for  00,  AA-'-lI  for  AI  , OW-d  I 
for  01,  and  AA  '^OO  for  AO.  The  iitpuc  diphthongs  are  automat  ic.i  1 1 v 
repl.iced  in  the  input  string  with  the  appropriate  two  I’owels  and  the 
length  of  each  vowel  is  reduced  by  20  ms. 

The  affricates,  CH  and  d,  have  a low  frequency  of  occurrence  in 
Ar.u'rican  speech.  CH  appears  only  O.AA'i  of  the  time  and  J appears  O.Sl') 
ot  the  time  (Ref.  2 : S) . CH  has  a stop  gap  ot  silence  followed  by  a 
burst  of  noise,  similar  to  a T,  and  has  a long  perioil  of  noise,  similar 
to  SH,  tollviwing  the  burst.  d has  a voiced  release  similar  to  a O fol- 
lowed by  a long  period  o'  voicotl  frication  similar  to  711.  These  two 
sounds  are  simulated  by  treating  them  as  the  two  phoneme  seqiu-uces 
TTSH  and  DDl'dl. 

^iipljtj^le  Chang.es 

Amplitude  and  source  characteristics  must  tu'  tuned  to  each  other 
and  t ,1  the  tormant  transitions  or  essential  cues  will  not  be  present  in 

1 11 


5 

I 
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till'  Amp  1 1 1 lull’  sluniUl  bi\v,in  .itti'i  t lu-  U’lm.ints  bfi’in 

moviUi’,  tviw.it.l  t h<-  lU’w  plu'iU'iHv's  Init  wo  I I boloio  t ho  movomout  is  oompU’to. 
rho  .r.'.ioimt  ot  viol.iv  p.io.itlv  .it  loots  tho  .imoiuit  ot  t t .m.-.  1 1 i on  tli.it  is 
ho.ii.i  .iiui  , thorotoro,  tlio  i 0000,11 1 .Mb  i 1 i t v ot  t lio  plionomo.  bor  ox.iiiplo, 
ill  ;i  stop-vowol  t inns  1 1 ion , it  tiu'  .implitiulo  oh. 1110,0  is  tiiino.l  on  t .'o 
l.ito  or  too  .slowly  tho  tr.insition  is  K'st  .uui  tho  stop  will  not  bo  oor- 
roo  t 1 y porooivoh.  In  n vowol-stop  I r.ins  i t i >'ii , on  tlu'  othoi  h.in.i , it 

tho  .imp  I 1 1 uvlo  i.s  roiinoo.l  t oi>  ipiioklv  tho  t r.iiis  i t 1011  is  likowiso  lost, 
thus  m.ikino,  it  hittioiilt  to  roooyii  i .'.o  tho  stop.  In  .ill  onsos  loopi  i i 1 no. 

,i  oh. 1110,0  ill  soiiroo  oh.i  r.io  t or  i s t i os  tho  t imo  ot  sonroo  oh. 1110,0  is  b.isi'.l 

on  tho  t imo  .iiui  r.ito  ot  tr.insition  ot  lorm.int  0110. 

Kor  oonsoiuin  t -vowo  1 tr.insitions  .1  l.ir,o,o  po  rooiit  .i;',o  ot  tho  torm.iiit 
tr.insitions  shouKl  ooour  .ittor  tho  souroo  oh.ir.iot  or  i st  ios  .iro  oh.inpivi, 
Ttuis,  in  this  implomont  .it  ion , tho  switoli  t.ikos  pi. 100  il  ms  t.ibont  .U'"„ 
ot  tht'  tr.insition  h.is  t.ikon  pl.iool  .ittor  tr.insition  boyiiis.  For  v.'wo  1 - 
ooiisoii.int  tr.insitions  .i,o,,iin  il  ms  is  nsi'.l  c’xoopt  whiMi  tho  oonsoii.int  is 
n stop.  Ill  tli.it  o.isi'  tho  sonroo  oh.ir.io  t o r 1 s t i o is  oh.inyoh  1 . b i 1 ms 

i.iboiit  4b",.1  nftor  tiirm.iiit  tr.insition  l'o,o,ins.  1 . t 1 ms  is  nso.l  in 

oonsoii.int -oono.on.int  tr.insitions  wlu'ii  .1  stop  is  in  tho  sov-v'ii.l  p.'sitioii; 
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Wutiovit  pits'll  ino  J.u  1 .1 1 1 oil  lulos  t ii<'  s vu  l hi- s i .-.i' r spi-.iks  in  n tint 
lUinuH  I'lit- . I'tu-  I'l'li't  lit  nntnrnl  spi-i-nh  tnsult::  ui  n l.it.'.i'  part  t I'n-.u  t hi- 
pitnh  in  t 1 t-n  t U'U  wi-  laipi'-.n  nn  tin-  hnsin  Mpni-ih  iiu- ssn>;i-  in  t I nn  t i n-.-,  oni 
t i-i- 1 1 iijts  nhnnt  ihn  nnnti-xt.  ritnh  vnii.itinn  is  not  ossonttnl  to  spoooh 
sviuhosis,  hnt  tho  monot  onv  ot  n monotoiu-  Jotmots  troiu  t ho  iin.i  1 i t v ot 
t ho  synthotio  spoooli.  riioiotoio  wo  liovisoii  somo  sii-.iplo  rnlos  to  pio- 
iluoo  piton  vniintions. 


Dnr  1 11  j; 

a VO  10 Oil 

soiniil 

tho  pit 

.■h 

i s 

vail  Oil  invorsolv  with  tho  valiio 

Ot 

toiTi.int  ono  ovoi  a 

ranpo 

O t ill 

Co 

1 is 

Ho  as  lorniant  ono  varios 

tH*: 

wo  on  J.’O  t 

0 TaO  Ho 

. 'I'ho 

p i t 0 h r; 

:av 

al  s 

o ho  altoroii  linrinp,  tho  la.-:t 

.'00 

ms  o t an 

lit  t oraiioi 

L-  . So 

r a St  at 

i‘UU‘ 

nt 

tho  pitoh  lirops  hy  ah  llo  in 

t h 1 

s po  1 i Oil ; 

will- 10 as  , 

1 0 1 a 

quo  s t i 0 

n , 

t ho 

pitoh  risos  hv  -ts  llo.  Tho 

[HE 

oh  mav  ho 

holii  sto 

ahv  at 

t ho  Ollii 

o I 

an 

uttoranoo  whon  liosiroii,  suoii 

as 

tor  till-  ro 

0 1 t ,1 1 i on 

ot  a 

list  o t 

WvM' 

lis. 

Sc  t oss 

Whon  n vowo  I is  st  rossoh  in  nniiirnl  spoooh  thioo  thini-.s  hn-p-pon; 
tho  pitoh  risos,  tho  nmplitiuio  inoro.isos,  .inh  tho  phonouio  is  loupthonod. 
All  tluoo  ottoots  h.ivo  I'oon  i nooi  po  t n t oh  in  this  s i mn  1 n t i on  , I'ho  pitoh 

niul  nr.ip  1 1 1 iiilo  ,no  both  inisoh  hv  nont  JOl,  tor  tho  Jm  nt  ion  ot  tho  vowol. 

I'ho  niiioiint  tin-  vowol  is  loiipt  honoh  is  n tnnotion  ot  tho  lollowini; 

phonomo.  In  n stiulv  hv  llonso  tKot.  ' ^ it  was  toiuut  that  tho  lonyost 

strossoil  vowo  I s aio  tollowoh  hv  voiooh  soniuls.  lloimo's  o\ po  t i -.-.’.ont  s 
woro  tor  isolatoh  woiats  aiul  tho  l oimlts  woio  toniiil  to  ho  iniao oo pt  ah  1 o 
tor  oonnootoii  spoooh.  I'liorotoio,  tolli'winp,  Kahinor's  loah,  all  hni  ation 


W’t‘  fi'  t c 

■diu-oil  bv 

100  ms. 
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^'out  i'\C  u.i  I 

Th»'  ^■oatoxt  in  whu'h  .i  ptuMuniu'  .ipjH’.irs  m.iv  h.ivi-  ,m  i‘ l t iv  t ,'n  t!u' 
w.iv  it  is  utti'iinl.  T!u'  .irnas  ot  »:ou  t o x t n.i  I niunts  wlui.'li  wi'in 
sivinri,’..!  art'  word  bonnda  r i os  , initiation  and  sluit -down  . and  ot’rtain 
phoiU'iiu'  oonihinat  ions. 

Wt>rd  In'uiuiaj^i  t'S . Koiwl  btniudarios  havo  only  a minimal  t't'.t'Ot  on 


OO II 11 0 0 t 0 
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An  K i n 
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ini  t i.il  pos  1 1 ion 

is 

oh.iiiyo.i  to 

tho 

allop 

liono  KO. 

Initial  it'ns  aiut  Shu  t -Pt'wns  . U'lu'n  a pluMit'mt'  is  at  t iu'  hi- y i nn  i ny,  t'r 
t'liti  t't  an  nttoranoo  it  is  handlt-d  d i to  I't'u  t 1 v than  it  it  is  intt'rnal  t 
tho  nttoranoo.  An  initial  vowo  1 has  a ylottal  stop  tt'f  a ht'y  i .in  i ny 
tollowt'd  hv  about  SO  ms  ot  stoadv  st.ito.  An  initi.il  stop  is,  ot  Ot'urso, 
pit'ot'di'd  bv  .1  poriod  of  silono;'  Sv'  tho  stop  bt'yins  with  tho  rolt'.iso. 

.Ml  t'tlu'r  phonomos  bt’y.in  with  tho  form.ints  .it  stt'.uiv  st.ito  .iiid  l.ist  ti'r 
.ippi  t'x ini.it  t‘ 1 V SO  ms. 

A tin.il  vowo  1 is  louyt  houotl  bv  .ibt'iit  J''''...  .A  fin.il  W'wo  1 , trio.i- 
tivo,  or  n.is.il  has  .i  yr.idu.il  shut-down  v>t  soiirot*  .imp  1 i t inio . In  .i  Vt'iot'd 
trio.itivo  t ht'  Vi'ioo  souroi'  shuts  ot  t moro  r.ipivlly  tli.r.i  tho  iioiso  sour.'o 
.111,1  thus  tliov  sound  liko  thoir  trio.itivo  o oun  t o rp.i  r t s tor  tho  l.ist  ''0  ms 
o!  tho  uttoi.inoo.  .A  fin.il  stop  h.is  .i  K'w  lovol,  short  I’ll  insortod 
bi'Ioio  tho  soiiroo  i .:  shut  otf. 


U'lu'u  .1  porioil  or  comma  is  oucouu  t o rcii  t iio  uttot.iuco  is  1 1’ r."  i ii.i  c oti 
as  above  aii>l  a pause,  or  period  ot  siU’iico,  is  paMio  ra  t I '-100  ";s  ;,>r' 
comma  aiui  MSO  ms  tor  porioii).  Attcr  the  paiist-  a now  uttoraiic.'  is 
mi  t iato..i . 

PtivMiomo  Combi  iiat  ions . Wlion  the  back  vowels,  OW,  C,  00,  are  suc- 
ceeiitwi  bv  K,  TH,  S,  B,  M,  V,  TK , or  Z the  secoiivt  formant  of  the  seco 
phoneme  is  increased  by  AOO  112. 
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A phoneme- based  automatic  speech  recognition  system  was  developed 
and  tested  using  synthetic  speech.  The  acoustic  signal  is  divided 
into  short  segments  for  analysis;  segments  are  either  a single  pitch 
period  of  voiced  speech  or  a 10  ms  sample  of  voiceless  speech.  These 
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Each  group  of  segments  represents  a phoneme  and  is  identified  by 
simple  algorithms  operating  on  the  string  of  phonemically-named 
segments  that  form  the  group. 

The  phoneme-based  recognition  system  was  tested  using  isolated 
synthesized  words  which  permitted  evaluation  with  connected  strings 
of  phonemes  but  stopped  short  of  requiring  development  of  word 
boundary  rules.  The  tests  consisted  of  100  phonemically  balanced 
words  containing  281  phonemes.  Of  these,  245  were  correctly  ident- 
ified, 25  were  mis-identified,  15  were  missed  entirely,  and  11  were 
added.  However,  many  of  these  errors  were  predictable  or  understand- 
able and  may  be  overcome  at  a higher  (word  or  phrase)  level, 
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